Handle when ASV is longer than reference database sequence (BLAST) #26

jackscanlan · 2024-08-22T01:02:26Z

taxreturn::blast_top_hit function seems to have an issue that underestimates % identity when the ASV query is longer than reference subject. Probably related to this code: dplyr::mutate(full_pident = (pident * length)/(length - q_align + qlen))

Effects of this can be seen in TAX_SUMMARY_MERGE's taxonomic assignment summary file but likely affects other BLAST-based assignment throughout pipeline too. Most likely to be an issue when using Nanopore data with primers that sit outside the end of reference database sequence.

Could be worth re-implementing a modified version of the function in the bin/functions.R script.

The text was updated successfully, but these errors were encountered:

alexpiper · 2024-08-30T00:39:36Z

Confirming this was a bug in the full_pident calculation, and ive now updated the taxreturn function to handle this. But id suggest we move the blast_top_hit function from taxreturn over to freyr, and run blastn from the command line in a seperate process rather than calling it through R.

Ideally we should migrate all taxreturn code that freyr depends on over to this repo for ease of updating without having to rebuild the docker container. I think the PHMM stuff is the only other code that currently requires taxreturn.

The reasoning behind the full_pident calcultion is that blast may only align a portion of the sequence leaving gaps at the end, overestimating % identity. The full_pident calculation corrects this by calculating the % identityy across full length of query sequence. The problem was that if the query was longer than the subject, the calculation underestimates the % identity, because part of the query sequence is inherently unalignable.

Create test cases

Blast results are 0 indexed

result <- tribble(
  ~pident, ~length, ~qstart, ~qend, ~qlen, ~slen, ~case, ~expected_pident,
  0.99, 100, 0, 99, 100, 100, "Query completely overlaps", 0.99,
  0.99, 90, 0, 89, 100, 100, "Part of query overlaps", 0.891,
  0.99, 90, 0, 99, 110, 100, "Query longer than subject", 0.99
)

# A tibble: 3 × 8
  pident length qstart  qend  qlen  slen case                      expected_pident
   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <chr>                               <dbl>
1   0.99    100      0    99   100   100 Query completely overlaps           0.99 
2   0.99     90      0    89   100   100 Part of query overlaps              0.891
3   0.99     90      0    99   110   100 Query longer than subject           0.99

Old function

top_hit <- result %>%
  dplyr::mutate(q_align = qend - qstart + 1#, # Length of alignment between query and reference
  ) %>%
  dplyr::mutate(full_pident = (pident * length)/(length - q_align + qlen)) 

# Check result is expected
top_hit %>%
  mutate(full_pident = round(full_pident, 3)) %>% # Round to 3rd digit for comparison
  mutate(correct = expected_pident == full_pident)

# A tibble: 3 × 11
  pident length qstart  qend  qlen  slen case                    expected_pident q_align full_pident correct
   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <chr>                             <dbl>   <dbl>       <dbl> <lgl>  
1   0.99    100      0    99   100   100 query completely overl…           0.99      100       0.99  TRUE   
2   0.99     90      0    89   100   100 Part of query overlaps            0.891      90       0.891 TRUE   
3   0.99     90      0    99   110   100 Query longer than subj…           0.99      100       0.891 FALSE

Correct for cases when whole or part of the query aligns (and the query is same length as subject), but fails for cases when query is longer than subject

New function

top_hit2 <- result %>%
  dplyr::mutate(q_align = qend - qstart + 1, # Length of alignment between query and reference
                q_len_adj = ifelse(qlen > slen, slen, qlen) # Handle case when query is longer than subject
  ) %>%
  dplyr::mutate(full_pident = (pident * length)/(length - q_align + q_len_adj)) 

# Check result is expected
top_hit2 %>%
  mutate(full_pident = round(full_pident, 3)) %>% # Round to 3rd digit for comparison
  mutate(correct = expected_pident == full_pident)

# A tibble: 3 × 12
  pident length qstart  qend  qlen  slen case          expected_pident q_align q_len_adj full_pident correct
   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <chr>                   <dbl>   <dbl>     <dbl>       <dbl> <lgl>  
1   0.99    100      0    99   100   100 query comple…           0.99      100       100       0.99  TRUE   
2   0.99     90      0    89   100   100 Part of quer…           0.891      90       100       0.891 TRUE   
3   0.99     90      0    99   110   100 Query longer…           0.99      100       100       0.99  TRUE

Correct for all cases

jackscanlan added the bug Something isn't working label Aug 22, 2024

jackscanlan assigned alexpiper Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle when ASV is longer than reference database sequence (BLAST) #26

Handle when ASV is longer than reference database sequence (BLAST) #26

jackscanlan commented Aug 22, 2024 •

edited

Loading

alexpiper commented Aug 30, 2024 •

edited

Loading

Handle when ASV is longer than reference database sequence (BLAST) #26

Handle when ASV is longer than reference database sequence (BLAST) #26

Comments

jackscanlan commented Aug 22, 2024 • edited Loading

alexpiper commented Aug 30, 2024 • edited Loading

Create test cases

Old function

New function

jackscanlan commented Aug 22, 2024 •

edited

Loading

alexpiper commented Aug 30, 2024 •

edited

Loading