Consider using "sourmash search --containment" for "translate" #86

olgabot · 2020-08-03T17:10:26Z

Currently, sencha translate uses a simple match of whether 100% of the k-mers from the reading frame, match the reference proteome. But a "Franken k-mer" situation can happen where the reading frame has 100% match, but the k-mers are all from different genes. Using sourmash search --containment would only search for "consecutive" k-mers that all appear in a single gene (or maybe family of genes??) and would be an improvement over the current method.

Thanks to @bluegenes for the idea!

The text was updated successfully, but these errors were encountered:

olgabot · 2020-08-03T17:49:53Z

Maybe use sourmash lca search --containment by hacking the Least Common Ancestor (LCA) code to use gene/protein families instead of individual genes, so that the matches could be on a per-gene family level.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using "sourmash search --containment" for "translate" #86

Consider using "sourmash search --containment" for "translate" #86

olgabot commented Aug 3, 2020

olgabot commented Aug 3, 2020

Consider using "sourmash search --containment" for "translate" #86

Consider using "sourmash search --containment" for "translate" #86

Comments

olgabot commented Aug 3, 2020

olgabot commented Aug 3, 2020