Issue with Algorithm Progress Bar Stuck at jcmq.run - Assistance Requested #52

dnc-github · 2024-11-11T02:21:15Z

Hello,

I'm experiencing an issue with the algorithm you've developed. When I reach the jcmq.run part, the progress bar stalls, and the process doesn't seem to advance in 3 days. I'm reaching out to see if you have any suggestions on how to resolve this or if there's additional information you need from me to diagnose the problem.

Issue Description:
The progress bar stops updating when the algorithm executes the jcmq.run function. This has happened consistently across multiple attempts.

What I've Tried:
Restarting the algorithm.
Checking the system for any resource constraints (CPU, memory).
Ensuring that the input data is formatted correctly.

Additional Questions:
Could you advise on any common reasons for this behavior?
Are there any specific logs or system information that would be helpful for you to review?
I've attached screenshots of the error. If there's anything else you need from me, please let me know.

Thank you for your help.

Best,
Danni

frankligy · 2024-11-12T08:30:38Z

Hi @dnc-github,

Sorry for the inconvenience, first of all, just want to confirm something:

[1] are you using netMHCpan or MHCflurry for the binding prediction
[2] how's the memory usage for the job? I know you said you checked the memory, but based on this thread (https://stackoverflow.com/questions/44534288/multiple-fork-calls-cause-blockingioerror), out of memory seems to be one of the reason for this exact error, what's the total RAM you set for the job and how many cores you specified?

In terms of the log, if you may paste the whole stdout (the screenshots are good but they are only part of the stdout), that would be helpful to get a bigger picture.

Let me know,
Frank

dnc-github · 2024-11-13T08:10:49Z

Hi Frank,

Thank you for reaching out.

I'm currently using netMHCpan for the binding predictions.
I haven't specified a fixed memory limit. I typically let the job run based on the available memory on the machine at the time, without setting any specific constraints.
We have 30 RNA-seq samples, could you give some guidance on the approximate memory that would typically be sufficient? Any rough estimates would be very helpful.
I have attached the full error log at the bottom to give you a complete picture.

Thanks for your patience, and please let me know if there's anything else I should check.

Best regards,
Danni

Please see the error log below:
1%| | 2/212 [00:12<21:34, 6.16s/it]2024-11-12 17:54:07.096532: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
2024-11-12 17:54:07.172063: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
2024-11-12 17:54:07.275638: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
1%|▏ | 3/212 [00:12<15:01, 4.31s/it]2024-11-12 17:54:07.343846: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
2024-11-12 17:54:07.358881: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
2%|▏ | 4/212 [00:13<11:36, 3.35s/it]2024-11-12 17:54:08.027306: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
2%|▏ | 4/212 [00:13<10:59, 3.17s/it]2024-11-12 17:54:08.178822: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
2024-11-12 17:54:08.178909: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
2024-11-12 17:54:08.721788: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
2024-11-12 17:54:08.928115: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
1%| | 2/212 [00:14<26:00, 7.43s/it]2024-11-12 17:54:09.090625: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
1%|▏ | 3/212 [00:15<17:50, 5.12s/it]2024-11-12 17:54:11.112648: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
2024-11-12 17:54:11.541753: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
1%|▏ | 3/212 [00:20<24:33, 7.05s/it]2024-11-12 17:54:15.606423: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
2%|▏ | 4/212 [00:26<24:16, 7.00s/it]2024-11-12 17:54:21.832821: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
3%|▎ | 6/212 [00:31<15:16, 4.45s/it]2024-11-12 17:54:27.044754: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
3%|▎ | 6/212 [00:45<22:37, 6.59s/it]2024-11-12 17:54:40.504924: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
74%|███████▎ | 156/212 [16:05<03:58, 4.26s/it]2024-11-12 18:10:00.901939: F tensorflow/core/platform/default/env.cc:72] Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.

frankligy · 2024-11-14T03:14:40Z

Hi Danni,

So far my guess is still due to the memory usage, when you run:

jcmq = snaf.JunctionCountMatrixQuery(junction_count_matrix=df,cores=30,add_control=add_control,outdir='result')

There's a parameter cores to specify the number of cpus you'd like to use, for instance, if I set cores=30, I have to make sure my computer or compute node has 30 cores, assuming you are using a institution HPC, this is usually set using shell directive (below is SLURM job scheduler):

#!/bin/bash
#SBATCH --partition=cpu_medium
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=30
#SBATCH --time=1-00:00:00
#SBATCH --mem=150Gb
#SBATCH --job-name="run_snaf"

The -cpus-per-task makes sure the compute resources have 30 cpus.

Regarding memory, since python multiprocessing each cpu will have shared memory, so the total memory usage will accumulate. I did a small test before (#27 (comment)) in which I concluded:

So I guess, in conclusion, by using netMHCpan and multiple cores, the memory usage should not exceed 25G, and from my experience for over 20K Neojunctions (after filter, before filter will be like 150K junctions), 200GB should be sufficient.

Again, the memory can be set using shell directive as well, in SLURM as shown above, it will be controlled by --mem.

Last but not least, sorry I was trying to refer to stdout instead of stderr, what you shared is stderr, which is helpful, but the stdout will be informative to figure out which particular step maybe the culprit.

Best,
Frank

shengxindaniu · 2025-01-21T03:33:16Z

Hi Frank,
I also encountered the same issue.
Best regards,
Shu

frankligy · 2025-01-21T23:00:54Z

Hi @shengxindaniu,

Although I can not definitely tell what might be the issue here, it seems that you are running using an interactive python console launched on your HPC/computer, and you specified 5 cpus for this job, I wonder the node in which you are running the job on, does it have 5 cpus available or only 1 cpu was requested?

I have a dummy test input and code here (https://github.com/frankligy/SNAF/tree/main/test), if you just execute the lines in the python code on your end, will you be able to successfully run it?

Also could you paste your code before JunctionCountMatrixQuery line? Just curious.

Thank you,
Frank

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Algorithm Progress Bar Stuck at jcmq.run - Assistance Requested #52

Issue with Algorithm Progress Bar Stuck at jcmq.run - Assistance Requested #52

dnc-github commented Nov 11, 2024 •

edited

Loading

frankligy commented Nov 12, 2024

dnc-github commented Nov 13, 2024

frankligy commented Nov 14, 2024 •

edited

Loading

shengxindaniu commented Jan 21, 2025

frankligy commented Jan 21, 2025

Issue with Algorithm Progress Bar Stuck at jcmq.run - Assistance Requested #52

Issue with Algorithm Progress Bar Stuck at jcmq.run - Assistance Requested #52

Comments

dnc-github commented Nov 11, 2024 • edited Loading

frankligy commented Nov 12, 2024

dnc-github commented Nov 13, 2024

frankligy commented Nov 14, 2024 • edited Loading

shengxindaniu commented Jan 21, 2025

frankligy commented Jan 21, 2025

dnc-github commented Nov 11, 2024 •

edited

Loading

frankligy commented Nov 14, 2024 •

edited

Loading