-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle when ASV is longer than reference database sequence (BLAST) #26
Comments
Confirming this was a bug in the full_pident calculation, and ive now updated the taxreturn function to handle this. But id suggest we move the blast_top_hit function from taxreturn over to freyr, and run blastn from the command line in a seperate process rather than calling it through R. Ideally we should migrate all taxreturn code that freyr depends on over to this repo for ease of updating without having to rebuild the docker container. I think the PHMM stuff is the only other code that currently requires taxreturn. The reasoning behind the full_pident calcultion is that blast may only align a portion of the sequence leaving gaps at the end, overestimating % identity. The full_pident calculation corrects this by calculating the % identityy across full length of query sequence. The problem was that if the query was longer than the subject, the calculation underestimates the % identity, because part of the query sequence is inherently unalignable. Create test casesBlast results are 0 indexed
Old function
Correct for cases when whole or part of the query aligns (and the query is same length as subject), but fails for cases when query is longer than subject New function
Correct for all cases |
taxreturn::blast_top_hit
function seems to have an issue that underestimates % identity when the ASV query is longer than reference subject. Probably related to this code:dplyr::mutate(full_pident = (pident * length)/(length - q_align + qlen))
Effects of this can be seen in
TAX_SUMMARY_MERGE
's taxonomic assignment summary file but likely affects other BLAST-based assignment throughout pipeline too. Most likely to be an issue when using Nanopore data with primers that sit outside the end of reference database sequence.Could be worth re-implementing a modified version of the function in the
bin/functions.R
script.The text was updated successfully, but these errors were encountered: