Overview of steps:
The steps for evidence based prediction are as follows:
- Map RNAseq reads to the genome
- Assemble transcripts using various transcript assemblers
- Run Mikado to consolidate transcripts and run programs to identify splice junctions, ORFs and full length matches to SwissProt plants proteins.
- Pick transcripts for each locus and finalize evidence based annotations.
- Post-processing of evidence based predictions.
Steps in detail:
The RNAseq reads generated for Ab10 were as follows:
ID | PlantStage | Tissue | Type | TotalReads | AvgLength |
---|---|---|---|---|---|
MN02011 | 8DAS | root | Paired | 29,818,892 | 152 |
MN02012 | 8DAS | root | Paired | 32,543,108 | 152 |
MN02021 | 8DAS | shoot | Paired | 26,420,582 | 152 |
MN02022 | 8DAS | shoot | Paired | 32,247,423 | 152 |
MN02031 | V11 | leaf-base | Paired | 22,147,809 | 151 |
MN02032 | V11 | leaf-base | Paired | 27,560,950 | 151 |
MN02041 | V11 | leaf-middle | Paired | 22,146,843 | 151 |
MN02042 | V11 | leaf-middle | Paired | 22,466,821 | 151 |
MN02051 | V11 | leaf-tip | Paired | 26,407,717 | 151 |
MN02052 | V11 | leaf-tip | Paired | 21,923,046 | 151 |
MN02061 | V18 | tassel | Paired | 21,278,567 | 151 |
MN02062 | V18 | tassel | Paired | 22,868,224 | 151 |
MN02071 | V18 | ear | Paired | 30,154,390 | 151 |
MN02072 | V18 | ear | Paired | 20,563,411 | 151 |
MN02081 | R1 | anther | Paired | 17,818,298 | 151 |
MN02082 | R1 | anther | Paired | 24,882,545 | 151 |
MN02091 | 16DAP | endosperm | Paired | 26,834,976 | 151 |
MN02092 | 16DAP | endosperm | Paired | 26,182,541 | 151 |
MN02101 | 16DAP | embryo | Paired | 14,833,469 | 151 |
MN02102 | 16DAP | embryo | Paired | 26,947,220 | 151 |
Reads were mapped using STAR mapping program as follows:
# index
runSTARmap_index.sh B73Ab10.pseudomolecules-v2.fasta
# first round mapping
for read1 in *R1.fq.gz; do
runSTARmap_round1.sh B73Ab10.pseudomolecules-v2.fasta ${read1}
done
# consolidate splice info
awk -f sjCollapseSamples.awk *_SJ.out.tab | sort -k1,1V -k2,2n -k3,3n > SJ.all
# second round mapping
for read1 in *R1.fq.gz; do
runSTARmap_round2.sh B73Ab10.pseudomolecules-v2.fasta ${read1}
done
# merge BAM files
merge-bam-files.sh B73_AB10
Five different transcript assemblers were ran on the merged BAM file (merged_B73_AB10.bam
) as follows:
runStrawberry.sh merged_B73_AB10.bam
runStringtie.sh merged_B73_AB10.bam
runTrinity.sh merged_B73_AB10.bam
runClass2.sh merged_B73_AB10.bam
runCufflinks.sh merged_B73_AB10.bam
Since Trinity generates a fasta file, we will use GMAP to map them and create a GFF3 file.
runGMAPcdna.sh B73Ab10.pseudomolecules-v2.fasta
We will also need splice junctions, that we can generate using PortCullis:
runPortcullis.sh merged_B73_AB10.bam
Here, we will use all the transcript assemblies generated in the previous step, along with splice junctions and pool them together. The redundant transcripts are also removed. Using consolidated transcripts, we will find full length trasncritps by BLAST searching against SwissProt proteins, and detect ORFs using TransDecoder.
This is all accomplished using this script:
copy-files-for-mikado.sh B73_Ab10
As a last step, we will run Mikado to finalize the predictions. It is done using this script:
finalize-files-and-pick.sh B73_Ab10
Kapeel's section