Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle when ASV is longer than reference database sequence (BLAST) #26

Open
jackscanlan opened this issue Aug 22, 2024 · 1 comment
Open
Assignees
Labels
bug Something isn't working

Comments

@jackscanlan
Copy link
Collaborator

jackscanlan commented Aug 22, 2024

taxreturn::blast_top_hit function seems to have an issue that underestimates % identity when the ASV query is longer than reference subject. Probably related to this code: dplyr::mutate(full_pident = (pident * length)/(length - q_align + qlen))

Effects of this can be seen in TAX_SUMMARY_MERGE's taxonomic assignment summary file but likely affects other BLAST-based assignment throughout pipeline too. Most likely to be an issue when using Nanopore data with primers that sit outside the end of reference database sequence.

Could be worth re-implementing a modified version of the function in the bin/functions.R script.

@jackscanlan jackscanlan added the bug Something isn't working label Aug 22, 2024
@alexpiper
Copy link

alexpiper commented Aug 30, 2024

Confirming this was a bug in the full_pident calculation, and ive now updated the taxreturn function to handle this. But id suggest we move the blast_top_hit function from taxreturn over to freyr, and run blastn from the command line in a seperate process rather than calling it through R.

Ideally we should migrate all taxreturn code that freyr depends on over to this repo for ease of updating without having to rebuild the docker container. I think the PHMM stuff is the only other code that currently requires taxreturn.

The reasoning behind the full_pident calcultion is that blast may only align a portion of the sequence leaving gaps at the end, overestimating % identity. The full_pident calculation corrects this by calculating the % identityy across full length of query sequence. The problem was that if the query was longer than the subject, the calculation underestimates the % identity, because part of the query sequence is inherently unalignable.

Create test cases

Blast results are 0 indexed

result <- tribble(
  ~pident, ~length, ~qstart, ~qend, ~qlen, ~slen, ~case, ~expected_pident,
  0.99, 100, 0, 99, 100, 100, "Query completely overlaps", 0.99,
  0.99, 90, 0, 89, 100, 100, "Part of query overlaps", 0.891,
  0.99, 90, 0, 99, 110, 100, "Query longer than subject", 0.99
)

# A tibble: 3 × 8
  pident length qstart  qend  qlen  slen case                      expected_pident
   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <chr>                               <dbl>
1   0.99    100      0    99   100   100 Query completely overlaps           0.99 
2   0.99     90      0    89   100   100 Part of query overlaps              0.891
3   0.99     90      0    99   110   100 Query longer than subject           0.99 

Old function

top_hit <- result %>%
  dplyr::mutate(q_align = qend - qstart + 1#, # Length of alignment between query and reference
  ) %>%
  dplyr::mutate(full_pident = (pident * length)/(length - q_align + qlen)) 

# Check result is expected
top_hit %>%
  mutate(full_pident = round(full_pident, 3)) %>% # Round to 3rd digit for comparison
  mutate(correct = expected_pident == full_pident)

# A tibble: 3 × 11
  pident length qstart  qend  qlen  slen case                    expected_pident q_align full_pident correct
   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <chr>                             <dbl>   <dbl>       <dbl> <lgl>  
1   0.99    100      0    99   100   100 query completely overl…           0.99      100       0.99  TRUE   
2   0.99     90      0    89   100   100 Part of query overlaps            0.891      90       0.891 TRUE   
3   0.99     90      0    99   110   100 Query longer than subj…           0.99      100       0.891 FALSE 

Correct for cases when whole or part of the query aligns (and the query is same length as subject), but fails for cases when query is longer than subject

New function

top_hit2 <- result %>%
  dplyr::mutate(q_align = qend - qstart + 1, # Length of alignment between query and reference
                q_len_adj = ifelse(qlen > slen, slen, qlen) # Handle case when query is longer than subject
  ) %>%
  dplyr::mutate(full_pident = (pident * length)/(length - q_align + q_len_adj)) 

# Check result is expected
top_hit2 %>%
  mutate(full_pident = round(full_pident, 3)) %>% # Round to 3rd digit for comparison
  mutate(correct = expected_pident == full_pident)

# A tibble: 3 × 12
  pident length qstart  qend  qlen  slen case          expected_pident q_align q_len_adj full_pident correct
   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <chr>                   <dbl>   <dbl>     <dbl>       <dbl> <lgl>  
1   0.99    100      0    99   100   100 query comple…           0.99      100       100       0.99  TRUE   
2   0.99     90      0    89   100   100 Part of quer…           0.891      90       100       0.891 TRUE   
3   0.99     90      0    99   110   100 Query longer…           0.99      100       100       0.99  TRUE

Correct for all cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants