Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is STEP4 taking longer to run after stricter QC and host read removal? #901

Open
SamBrutySci opened this issue Nov 4, 2024 · 7 comments

Comments

@SamBrutySci
Copy link

Hi,

I ran the pipeline a little while ago, using pretty light QC on my reads and without removing host reads. I have a few different sample types which I'm running in coassembly separately. Each sample type (6 metagenomes per sample type) took about 10 days to run the full pipeline.

Now, after choosing to use a more stringent QC approach and filtering out host reads before using squeezemeta (reduces reads by about 30 - 40%), STEP4 is taking far longer to run. So far 6 out of 11 sample types finished within 20 days (full pipeline) and some are still running (on STEP4 for ~17 - 21 days).

I was wondering if you had any suggestions as to why more stringent QC and host read removal would increase the STEP4 run time so much?

In case it has any impact, these samples were started at the same time as the one in issue #893 where the multiple start stops were likely the issue. However, I have tried re-running these samples into a completely new project/directory and the hanging on STEP4 persists.

Thanks so much for all your help with my issues!

@SamBrutySci
Copy link
Author

diamond.nr.log for one sample that has finished in a reasonable time and one stuck in STEP4

Sample that finished -- diamond.nr.log
Sample stuck on STEP4 for ~18 days -- diamond.nr.log

@fpusan
Copy link
Collaborator

fpusan commented Nov 5, 2024

Hard to tell, but the second one took 4 times longer to load the query sequences so maybe it has just more ORFs?

What's the number of contigs/ORFs before and after QC / host removal for the same samples? You can find those fasta files in project/results/01.*.fasta and project/results/03.*.faa`

@jtamames
Copy link
Owner

jtamames commented Nov 5, 2024

A more stringent QC control can improve the assembly by removing low-quality seqs that can hinder it. Hence more contigs, more ORFs, and longer to process.
Nevertheless, 18 days is too much. What kind of computer are you using?
Best,
J

@SamBrutySci
Copy link
Author

This is all running on a HPC, 350 GB RAM and 32 cores. Which is the exact same size as previously. They're pretty big samples, around 12.5 gigabases per sample (6 samples per coassembly)

For the sample still running:

Before strict QC
01.Cameor.fasta contains 6745493 sequences
03.Cameor.faa contains 8117038 sequences

After strict QC
01.Cameor.fasta contains 3634824 sequences
03.Cameor.faa contains 5170450 sequences

For the one that finished:

Before strict QC
01.0015.fasta contains 7662580 sequences
03.0015.faa contains 8008910 sequences

After strict QC
01.0015.fasta contains 1903809 sequences
03.0015.faa contains 2592522 sequences

@fpusan
Copy link
Collaborator

fpusan commented Nov 5, 2024

So you indeed have less sequences now, even if it's taking much longer.
Any chance your HPC filesystem is more strained now than it was before? If there is latency accessing the database DIAMOND can take significantly more time to run. An easy way to notice if you are IO limited is looking at CPU usage. When not loading data DIAMOND will be using all the CPUs you throw at it at 100%. But if it starts having to wait for database reads then the average CPU use will decrease.

@SamBrutySci
Copy link
Author

I've been keeping an eye on cpu usage and its pretty stable at 100% throughout the day. I've contacted the HPC admins to see if they have any ideas/solutions!

Do you think I can speed the runtime up by chucking more CPUs at it? Not exactly an elegant fix, but I want the jobs done! Would I have to restart from STEP1 if I change the number of CPUs?

In the meantime do you guys have any other ideas/things to check? No worries if not, we can wait and see what the admins come back with on my end.

@fpusan
Copy link
Collaborator

fpusan commented Nov 7, 2024

It's hard to tell and it seems strange that it takes longer with a smaller dataset. Maybe the DIAMOND developer will have some insight.
If you have 350Gb of RAM you can probably increase the DIAMOND block size a lot (-b parameter when calling SqueezeMeta). Increasing this will speed things up at the cost of memory usage. By default we calculate it based on the available RAM but we cap it at 15 IIRC because we've had issues with DIAMOND running out of memory before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants