Reat prediction protein alignment improvements #33

swarbred · 2022-06-27T14:31:41Z

@gemygk (as discussed)

Not a new issues as this was always present in the initial spaln / exonerate tests.

spaln protein to genome alignments often have misalligned terminal exons (often the
penultimate exon should simply be extend to the next in-frame stop). I've played around with various options including

"–LS (local similarity) option is sometimes useful for trimming off weakly matched terminal regions in the alignment."

with no positive effect also we have non-canonical splicing even with setting

"-ya N: Dinucleotide pairs at the ends of an intron (0)

N=0: Accept only the canonical pairs (GT..AG,GC..AG,AT..AC)"

While some of these issues are mitigated by how we use the alignments both in the pipeline and downstream i'm not completely happy as we don't have these specific issues in exonerate alignments of the same data (though overall the original test results were in favour of spaln).

I think we can identify and tag/remove some of the problematic terminal exons

In call-AlignProteins, post alignment remove terminal exons where terminal exon size is below X bp (say 30 bp) AND the terminal junction is over Y bp. For genomes with small introns where exons over Y will be uncommon then this would remove many alignment artefacts with little impact on removing "valid" terminal exons.

We would as default just add a tag in the mrna in the GFF e.g. QTERMEXO=1 if the terminal exon meet both terminal exon size and intron criteria (would need to check if we need to set QTERMEXO=0 for those not meeting the criteria to allow use in mikado). The non default option would be to trim off these terminal exon (updating mRNA and gene spans accordingly). There may some cases where after trimming the new terminal exon would meet the criteria and could also be trimmed (would happen fairly rarely though)

Removing those terminal exons and having incomplete protein alignments is likely preferable. Even if not removed and just marked in the gff tag then we can use that tag in the mikado step in scoring / filtering.

The obvious place would be to do this in the existing spaln2gff script but it could be done in a separate script

Other than terminal exon / junction length the exon identity (third col of the .s file we generate) could set a max value over which terminal exons are not tagged/removed i.e. require identity under 80%. The Alternative would be to set a threshold relative to the average exon identify of the aligned protein or across all aligned proteins. Given I've seen "incorrect small terminal exons" with 100% identity the use of just exon and junction might be sufficient.

The script already calculates id and cov and places this in the note Note=rna-XM_002439085.2|cov:100.00|id:68.80
Adding separate tags for ID and COV in addition would mean we can use these in mikado (which would be useful for the coverage at least)

Above is a crude hack but would help.

The more substantial improvement would be to add exonerate alignment as an option in a limited way. As discussed I've experimented with running exonerate on the proteins and regions pulled out from the spaln alignments i.e. realigning specific proteins to specific regions and this has value and appears practical in terms of runtime/resource (about 28 mins for 81056 alignments of 29456 query proteins). It would be an option in addition to the spaln alignments and would need to feed into the downstream scoring filtering.

#33

swarbred added the enhancement New feature or request label Jun 29, 2022

gemygk added a commit that referenced this issue Feb 27, 2023

feat: add method to filter questionable terminal cds/intron coordinates

394130b

#33

gemygk mentioned this issue May 26, 2023

Issue 33 #43

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reat prediction protein alignment improvements #33

Reat prediction protein alignment improvements #33

swarbred commented Jun 27, 2022

Reat prediction protein alignment improvements #33

Reat prediction protein alignment improvements #33

Comments

swarbred commented Jun 27, 2022