Gene Prediction

1. Evidence Based Gene Prediction

Overview of steps:

The steps for evidence based prediction are as follows:

Map RNAseq reads to the genome
Assemble transcripts using various transcript assemblers
Run Mikado to consolidate transcripts and run programs to identify splice junctions, ORFs and full length matches to SwissProt plants proteins.
Pick transcripts for each locus and finalize evidence based annotations.
Post-processing of evidence based predictions.

Steps in detail:

1. Map RNAseq reads:

The RNAseq reads generated for Ab10 were as follows:

ID	PlantStage	Tissue	Type	TotalReads	AvgLength
MN02011	8DAS	root	Paired	29,818,892	152
MN02012	8DAS	root	Paired	32,543,108	152
MN02021	8DAS	shoot	Paired	26,420,582	152
MN02022	8DAS	shoot	Paired	32,247,423	152
MN02031	V11	leaf-base	Paired	22,147,809	151
MN02032	V11	leaf-base	Paired	27,560,950	151
MN02041	V11	leaf-middle	Paired	22,146,843	151
MN02042	V11	leaf-middle	Paired	22,466,821	151
MN02051	V11	leaf-tip	Paired	26,407,717	151
MN02052	V11	leaf-tip	Paired	21,923,046	151
MN02061	V18	tassel	Paired	21,278,567	151
MN02062	V18	tassel	Paired	22,868,224	151
MN02071	V18	ear	Paired	30,154,390	151
MN02072	V18	ear	Paired	20,563,411	151
MN02081	R1	anther	Paired	17,818,298	151
MN02082	R1	anther	Paired	24,882,545	151
MN02091	16DAP	endosperm	Paired	26,834,976	151
MN02092	16DAP	endosperm	Paired	26,182,541	151
MN02101	16DAP	embryo	Paired	14,833,469	151
MN02102	16DAP	embryo	Paired	26,947,220	151

Reads were mapped using STAR mapping program as follows:

# index
runSTARmap_index.sh B73Ab10.pseudomolecules-v2.fasta
# first round mapping
for read1 in *R1.fq.gz; do
  runSTARmap_round1.sh B73Ab10.pseudomolecules-v2.fasta ${read1}
done
# consolidate splice info
awk -f sjCollapseSamples.awk *_SJ.out.tab | sort -k1,1V -k2,2n -k3,3n > SJ.all
# second round mapping
for read1 in *R1.fq.gz; do
  runSTARmap_round2.sh B73Ab10.pseudomolecules-v2.fasta ${read1}
done
# merge BAM files
merge-bam-files.sh B73_AB10

2. Transcript Assembly

Five different transcript assemblers were ran on the merged BAM file (merged_B73_AB10.bam) as follows:

runStrawberry.sh merged_B73_AB10.bam
runStringtie.sh merged_B73_AB10.bam
runTrinity.sh merged_B73_AB10.bam
runClass2.sh merged_B73_AB10.bam
runCufflinks.sh merged_B73_AB10.bam

Since Trinity generates a fasta file, we will use GMAP to map them and create a GFF3 file.

runGMAPcdna.sh B73Ab10.pseudomolecules-v2.fasta

We will also need splice junctions, that we can generate using PortCullis:

runPortcullis.sh merged_B73_AB10.bam

3. Consolidate transcripts and prepare files

Here, we will use all the transcript assemblies generated in the previous step, along with splice junctions and pool them together. The redundant transcripts are also removed. Using consolidated transcripts, we will find full length trasncritps by BLAST searching against SwissProt proteins, and detect ORFs using TransDecoder.

This is all accomplished using this script:

copy-files-for-mikado.sh B73_Ab10

4. Finalize evidence-based predictions

As a last step, we will run Mikado to finalize the predictions. It is done using this script:

finalize-files-and-pick.sh B73_Ab10

5. Post processing of evidence-based predictions

Kapeel's section

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Gene Prediction

1. Evidence Based Gene Prediction

1. Map RNAseq reads:

2. Transcript Assembly

3. Consolidate transcripts and prepare files

4. Finalize evidence-based predictions

5. Post processing of evidence-based predictions

Files

README.md

Latest commit

History

README.md

File metadata and controls

Gene Prediction

1. Evidence Based Gene Prediction

1. Map RNAseq reads:

2. Transcript Assembly

3. Consolidate transcripts and prepare files

4. Finalize evidence-based predictions

5. Post processing of evidence-based predictions