-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is STEP4 taking longer to run after stricter QC and host read removal? #901
Comments
diamond.nr.log for one sample that has finished in a reasonable time and one stuck in STEP4 Sample that finished -- diamond.nr.log |
Hard to tell, but the second one took 4 times longer to load the query sequences so maybe it has just more ORFs? What's the number of contigs/ORFs before and after QC / host removal for the same samples? You can find those fasta files in |
A more stringent QC control can improve the assembly by removing low-quality seqs that can hinder it. Hence more contigs, more ORFs, and longer to process. |
This is all running on a HPC, 350 GB RAM and 32 cores. Which is the exact same size as previously. They're pretty big samples, around 12.5 gigabases per sample (6 samples per coassembly) For the sample still running: Before strict QC After strict QC For the one that finished: Before strict QC After strict QC |
So you indeed have less sequences now, even if it's taking much longer. |
I've been keeping an eye on cpu usage and its pretty stable at 100% throughout the day. I've contacted the HPC admins to see if they have any ideas/solutions! Do you think I can speed the runtime up by chucking more CPUs at it? Not exactly an elegant fix, but I want the jobs done! Would I have to restart from STEP1 if I change the number of CPUs? In the meantime do you guys have any other ideas/things to check? No worries if not, we can wait and see what the admins come back with on my end. |
It's hard to tell and it seems strange that it takes longer with a smaller dataset. Maybe the DIAMOND developer will have some insight. |
Hi,
I ran the pipeline a little while ago, using pretty light QC on my reads and without removing host reads. I have a few different sample types which I'm running in coassembly separately. Each sample type (6 metagenomes per sample type) took about 10 days to run the full pipeline.
Now, after choosing to use a more stringent QC approach and filtering out host reads before using squeezemeta (reduces reads by about 30 - 40%), STEP4 is taking far longer to run. So far 6 out of 11 sample types finished within 20 days (full pipeline) and some are still running (on STEP4 for ~17 - 21 days).
I was wondering if you had any suggestions as to why more stringent QC and host read removal would increase the STEP4 run time so much?
In case it has any impact, these samples were started at the same time as the one in issue #893 where the multiple start stops were likely the issue. However, I have tried re-running these samples into a completely new project/directory and the hanging on STEP4 persists.
Thanks so much for all your help with my issues!
The text was updated successfully, but these errors were encountered: