You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been taking a look at cases where Pandora incorrectly annotates on gene as another. I took the first allele in each MSA that build my panRG and used Pandora map adding the --debugging-files argument to let me find out the minimizer hits to different PRGs. Using a subsample of 5000 genes I have estimated that pandora correctly annotates genes ~90.7% of the time when the queried allele is the identical to one in the MSA. A common cause of misannotation is when a large number of minimizers map in a tight cluster to a long gene, rather than spanning the full length of the gene as you would expect if the gene was correctly called. You can see this in the .png file in the gzipped directory attached. This plot shows k-mers for each gene on the x-axis with the colours showing the minimizer hits from the pandora .minimatches file (red shows hits in the incorrect gene and blue shows hits in the correct gene). The true gene is "group_4445" but Pandora annotates it as "group_25558" as there are weirdly more minimizers mapping to the incorrect gene than the true gene. I have compiled a minimal example for these genes in the directory attached. This includes the MSAs, the panRG, the query sequences and the Pandora map output.
The text was updated successfully, but these errors were encountered:
Danderson123
changed the title
Incorrect annotation of one gene as another (with minimal example)
Incorrect annotation of one gene as another due to tight clustering of minimizer hits (with minimal example)
Jun 29, 2023
Hey Leandro!
I have been taking a look at cases where Pandora incorrectly annotates on gene as another. I took the first allele in each MSA that build my panRG and used Pandora map adding the
--debugging-files
argument to let me find out the minimizer hits to different PRGs. Using a subsample of 5000 genes I have estimated that pandora correctly annotates genes ~90.7% of the time when the queried allele is the identical to one in the MSA. A common cause of misannotation is when a large number of minimizers map in a tight cluster to a long gene, rather than spanning the full length of the gene as you would expect if the gene was correctly called. You can see this in the .png file in the gzipped directory attached. This plot shows k-mers for each gene on the x-axis with the colours showing the minimizer hits from the pandora.minimatches
file (red shows hits in the incorrect gene and blue shows hits in the correct gene). The true gene is "group_4445" but Pandora annotates it as "group_25558" as there are weirdly more minimizers mapping to the incorrect gene than the true gene. I have compiled a minimal example for these genes in the directory attached. This includes the MSAs, the panRG, the query sequences and the Pandora map output.I used
make_prg
v0.4.0 andpandora_b19d26
.Best wishes,
Daniel
tightly_clustered_minimizers.tar.gz
The text was updated successfully, but these errors were encountered: