Consider switching host removal to graph based reference #297

jfy133 · 2023-05-22T09:47:04Z

Description of feature

As originally raised by @d4straub on Slack, a big issue in metagenomics/microbiome research is insufficient removal of host sequences from libraries, with many public data uploads containing individual-identifiable sequences.

In my opinion (shared with others), one of the biggest causes of this is suboptimal reference genomes which do not capture the whole diversity of host reference genomes.

One solution to this is to map against reference graphs that can contain SNPs from more than one individuals/populations.

@subwaystation as a pangenome expert suggests using a pre-computed human reference genome in combination with vg giraffe to do the mapping.

The text was updated successfully, but these errors were encountered:

jfy133 · 2023-05-22T12:05:59Z

https://github.com/vgteam/vg/wiki/Mapping-short-reads-with-Giraffe#mapping-with-vg-giraffe

subwaystation · 2023-05-22T12:16:11Z

Hi James et al. :)

My suggestion would be that you take the HPRC CHM13 minigraph-cactus pangenome graph and map your reads against it using vg giraffe. As far as I know there are pre-build indices available at https://github.com/human-pangenomics/hpp_pangenome_resources#minigraphcactus. However, they may not be compatible with the current vg version. I couldn't find any documentation which index version fits which vg version. Ideally, it is possible for vg giraffe to report reads with low mapping quality. Or reads that multimap, because then we would have to drop these, too. But I lack experience here.

Hopefully @jeizenga can elobarate more. Else I will bug him in the US personally :P
Note that Jordan will also give a tutorial on read mapping with vg giraffe in the US, which might be more up-to-date than your link @jfy133. Will keep you posted.

One question from my side: Is there already a data set for benchmarking, or how would you evaluate this @d4straub?

jeizenga · 2023-05-22T16:30:32Z

As far as I know, there are up-to-date indexes at that link. If you want to find multimapping reads, you can use the -M argument in vg giraffe. If you want low MAPQ reads, you can pipe the results through vg filter -q {max mapq} -U -.

subwaystation · 2023-05-22T16:47:26Z

Ah, that's good news, thanks @jeizenga!

@jfy133 @d4straub Do you now know a way forward? Happy to discuss this again in person.

jfy133 · 2023-05-22T18:10:05Z

I think we would need to experiment first, so it'll be a process! But this is some very useful first pointers bother for taxprofiler and also mag etc. - thanks both!

Midnighter · 2023-05-22T20:13:07Z

One thing I really like about taxprofiler is that it supports long reads, too. Is there anything comparable to giraffe for long reads?

jeizenga · 2023-05-22T20:21:57Z

The vg giraffe developers are working on long read mapping, but it's not mature or stable yet. There's also GraphAligner, which is pretty good for noisy long reads (ONT <= R9, PacBio CLR) but not as great for accurate long reads (ONT >= R10, PacBio HiFi). You could also look at minigraph, but it's only appropriate for graphs that have primarily long nodes, i.e. ones that don't include point variation. There are also some experimental long read features with vg mpmap --nt-type DNA --read-length long.

subwaystation · 2023-05-23T20:44:06Z

To complete the short-read mapper list: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad320/7160913?login=false. However, the tool can only start from VCF and not from GFA. So you would not be able to use the HPRC graphs. Maybe worth to monitor how the tool develops. There already is an issue open thomas-buechler-ulm/gedmap#1.

d4straub · 2023-06-13T07:33:53Z

Very interesting discussion! I think we do need to evaluate this, but this seems more than just an afternoon of work.

Is there already a data set for benchmarking, or how would you evaluate this

I haven't researched this, but that is an important question. I stumbled across publications that evaluated de-contamination, but as usual its either synthetic datasets or the ground truth isnt known, as far as I recall.

subwaystation · 2023-06-13T08:00:07Z

To help you get started, you can check out @jeizenga's MemPanG23 pangenome read mapping tutorials at https://pangenome.github.io/MemPanG23/#_practical_course_central_time_zone.

jfy133 added the enhancement Improvement for existing functionality label May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider switching host removal to graph based reference #297

Consider switching host removal to graph based reference #297

jfy133 commented May 22, 2023 •

edited

Loading

jfy133 commented May 22, 2023

subwaystation commented May 22, 2023 •

edited

Loading

jeizenga commented May 22, 2023 •

edited

Loading

subwaystation commented May 22, 2023

jfy133 commented May 22, 2023

Midnighter commented May 22, 2023

jeizenga commented May 22, 2023

subwaystation commented May 23, 2023 •

edited

Loading

d4straub commented Jun 13, 2023

subwaystation commented Jun 13, 2023

Consider switching host removal to graph based reference #297

Consider switching host removal to graph based reference #297

Comments

jfy133 commented May 22, 2023 • edited Loading

Description of feature

jfy133 commented May 22, 2023

subwaystation commented May 22, 2023 • edited Loading

jeizenga commented May 22, 2023 • edited Loading

subwaystation commented May 22, 2023

jfy133 commented May 22, 2023

Midnighter commented May 22, 2023

jeizenga commented May 22, 2023

subwaystation commented May 23, 2023 • edited Loading

d4straub commented Jun 13, 2023

subwaystation commented Jun 13, 2023

jfy133 commented May 22, 2023 •

edited

Loading

subwaystation commented May 22, 2023 •

edited

Loading

jeizenga commented May 22, 2023 •

edited

Loading

subwaystation commented May 23, 2023 •

edited

Loading