-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider switching host removal to graph based reference #297
Comments
Hi James et al. :) My suggestion would be that you take the HPRC CHM13 minigraph-cactus pangenome graph and map your reads against it using Hopefully @jeizenga can elobarate more. Else I will bug him in the US personally :P One question from my side: Is there already a data set for benchmarking, or how would you evaluate this @d4straub? |
As far as I know, there are up-to-date indexes at that link. If you want to find multimapping reads, you can use the |
I think we would need to experiment first, so it'll be a process! But this is some very useful first pointers bother for taxprofiler and also mag etc. - thanks both! |
One thing I really like about taxprofiler is that it supports long reads, too. Is there anything comparable to giraffe for long reads? |
The |
To complete the short-read mapper list: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad320/7160913?login=false. However, the tool can only start from VCF and not from GFA. So you would not be able to use the HPRC graphs. Maybe worth to monitor how the tool develops. There already is an issue open thomas-buechler-ulm/gedmap#1. |
Very interesting discussion! I think we do need to evaluate this, but this seems more than just an afternoon of work.
I haven't researched this, but that is an important question. I stumbled across publications that evaluated de-contamination, but as usual its either synthetic datasets or the ground truth isnt known, as far as I recall. |
To help you get started, you can check out @jeizenga's MemPanG23 pangenome read mapping tutorials at https://pangenome.github.io/MemPanG23/#_practical_course_central_time_zone. |
Description of feature
As originally raised by @d4straub on Slack, a big issue in metagenomics/microbiome research is insufficient removal of host sequences from libraries, with many public data uploads containing individual-identifiable sequences.
In my opinion (shared with others), one of the biggest causes of this is suboptimal reference genomes which do not capture the whole diversity of host reference genomes.
One solution to this is to map against reference graphs that can contain SNPs from more than one individuals/populations.
@subwaystation as a pangenome expert suggests using a pre-computed human reference genome in combination with
vg giraffe
to do the mapping.The text was updated successfully, but these errors were encountered: