Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider switching host removal to graph based reference #297

Open
jfy133 opened this issue May 22, 2023 · 10 comments
Open

Consider switching host removal to graph based reference #297

jfy133 opened this issue May 22, 2023 · 10 comments
Labels
enhancement Improvement for existing functionality

Comments

@jfy133
Copy link
Member

jfy133 commented May 22, 2023

Description of feature

As originally raised by @d4straub on Slack, a big issue in metagenomics/microbiome research is insufficient removal of host sequences from libraries, with many public data uploads containing individual-identifiable sequences.

In my opinion (shared with others), one of the biggest causes of this is suboptimal reference genomes which do not capture the whole diversity of host reference genomes.

One solution to this is to map against reference graphs that can contain SNPs from more than one individuals/populations.

@subwaystation as a pangenome expert suggests using a pre-computed human reference genome in combination with vg giraffe to do the mapping.

@jfy133 jfy133 added the enhancement Improvement for existing functionality label May 22, 2023
@jfy133
Copy link
Member Author

jfy133 commented May 22, 2023

@subwaystation
Copy link

subwaystation commented May 22, 2023

Hi James et al. :)

My suggestion would be that you take the HPRC CHM13 minigraph-cactus pangenome graph and map your reads against it using vg giraffe. As far as I know there are pre-build indices available at https://github.com/human-pangenomics/hpp_pangenome_resources#minigraphcactus. However, they may not be compatible with the current vg version. I couldn't find any documentation which index version fits which vg version. Ideally, it is possible for vg giraffe to report reads with low mapping quality. Or reads that multimap, because then we would have to drop these, too. But I lack experience here.

Hopefully @jeizenga can elobarate more. Else I will bug him in the US personally :P
Note that Jordan will also give a tutorial on read mapping with vg giraffe in the US, which might be more up-to-date than your link @jfy133. Will keep you posted.

One question from my side: Is there already a data set for benchmarking, or how would you evaluate this @d4straub?

@jeizenga
Copy link

jeizenga commented May 22, 2023

As far as I know, there are up-to-date indexes at that link. If you want to find multimapping reads, you can use the -M argument in vg giraffe. If you want low MAPQ reads, you can pipe the results through vg filter -q {max mapq} -U -.

@subwaystation
Copy link

Ah, that's good news, thanks @jeizenga!

@jfy133 @d4straub Do you now know a way forward? Happy to discuss this again in person.

@jfy133
Copy link
Member Author

jfy133 commented May 22, 2023

I think we would need to experiment first, so it'll be a process! But this is some very useful first pointers bother for taxprofiler and also mag etc. - thanks both!

@Midnighter
Copy link
Collaborator

One thing I really like about taxprofiler is that it supports long reads, too. Is there anything comparable to giraffe for long reads?

@jeizenga
Copy link

The vg giraffe developers are working on long read mapping, but it's not mature or stable yet. There's also GraphAligner, which is pretty good for noisy long reads (ONT <= R9, PacBio CLR) but not as great for accurate long reads (ONT >= R10, PacBio HiFi). You could also look at minigraph, but it's only appropriate for graphs that have primarily long nodes, i.e. ones that don't include point variation. There are also some experimental long read features with vg mpmap --nt-type DNA --read-length long.

@subwaystation
Copy link

subwaystation commented May 23, 2023

To complete the short-read mapper list: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad320/7160913?login=false. However, the tool can only start from VCF and not from GFA. So you would not be able to use the HPRC graphs. Maybe worth to monitor how the tool develops. There already is an issue open thomas-buechler-ulm/gedmap#1.

@d4straub
Copy link

Very interesting discussion! I think we do need to evaluate this, but this seems more than just an afternoon of work.

Is there already a data set for benchmarking, or how would you evaluate this

I haven't researched this, but that is an important question. I stumbled across publications that evaluated de-contamination, but as usual its either synthetic datasets or the ground truth isnt known, as far as I recall.

@subwaystation
Copy link

To help you get started, you can check out @jeizenga's MemPanG23 pangenome read mapping tutorials at https://pangenome.github.io/MemPanG23/#_practical_course_central_time_zone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality
Projects
None yet
Development

No branches or pull requests

5 participants