Individual pipeline failure not propagated to the leading job #119

zhuchcn · 2023-06-29T22:12:36Z

Describe the issue

@smahesh12 submitted 3 multi samples, two of them failed because scratch was detached, but the stdout from the leading job says all three samples were successful. I think this is also becuase of the job limiter, so individual pipeline failure was not propagated to the leading job. Maybe we can just raise the error here after this line? Maybe just add exit 1?

metapipeline-DNA/main.nf

Line 207 in 308abdf

echo "Process in ${work_dir} failed with non-zero exit code."

The text was updated successfully, but these errors were encountered:

yashpatel6 · 2023-06-29T23:00:56Z

The main log file should have indicated that the samples failed with that message on the line you've highlighted, was that not the case?

yashpatel6 · 2023-06-29T23:08:29Z

I did try having exit 1 at that point and setting errorStrategy to ignore but that results in the message indicating which sample/work_dir failed not being printed, instead Nextflow only reports that the status check process failed. That makes it difficult to identify which specific sample/work_dir failed so I left the reporting as a message printed to the main log file without letting the status check process fail.

zhuchcn · 2023-06-30T16:40:09Z

Oh I do see the log message. But the final status is success, so is the status in the trace txt file which is also a little confusing.

executor >  local (10)
[58/1fa394] process > create_CSV_metapipeline_DNA... [100%] 3 of 3 ✔
[00/27fc23] process > create_config_metapipeline_DNA [100%] 1 of 1 ✔
[34/55b72f] process > call_metapipeline_DNA (3)      [100%] 3 of 3 ✔
[7b/395dd3] process > check_process_status (3)       [100%] 3 of 3 ✔
Process in /hot/project/process/MissingPeptides/MISP-000132-MissingPeptidesPanCanP1/CPTAC_LUAD/leading_work_dir/34/55b72f435df118276d7b50c2824553 failed with non-zero exit code.


Completed at: 27-Jun-2023 01:54:06
Duration    : 3d 7h 36m 56s
CPU hours   : 122.5
Succeeded   : 10

If we print the message, but also set exit to 1, will the printed message be hidden?

yashpatel6 · 2023-06-30T16:54:15Z

If we print the message, but also set exit to 1, will the printed message be hidden?

Yeah this was the issue I observed, if the process exits 1, then the echo statements end up not getting printed to the log file. We could potentially add another process to handle reporting the failure through exit 1 to have both the message printed by the current process and the failure indicated explicitly by the other process

zhuchcn · 2023-06-30T17:10:44Z

I'm thinking whether it's worth fixing but it seems to be a problem setting errorStratagy to ignore because all subsequent samples will continue.

So if we echo the message and exit with 1, we should still be able to find the message from the .command.log file? I'm thinking if that's actually easier.

yashpatel6 · 2023-06-30T17:23:17Z

I'm thinking whether it's worth fixing but it seems to be a problem setting errorStratagy to ignore because all subsequent samples will continue.

This is the behavior we would want though, even if some samples fail, the rest of them should continue. The issue comes from the interaction between having the strategy set to ignore, exiting 1, and also printing to the main log file.

So if we echo the message and exit with 1, we should still be able to find the message from the .command.log file? I'm thinking if that's actually easier.

This is true, the message would still be in the .command.log file. Generally, I find it's easier for end users to identify which samples failed through the main log file rather than having to go through multiple files to try to identify the logs for failed samples, which is why I'd lean towards having an extra process but keeping the messages with the direct path to failed sample work dirs in the main log.

zhuchcn · 2023-06-30T17:36:53Z

Yeah I'm fine with different error stratagey handling but for hardware issues like this, subsequent samples will likely be sent to the same node and fail. Of couse we could just document it down and let users know this can happen if we decide it this way.

zhuchcn · 2023-06-30T17:38:15Z

How about we put job monitoring to the same process that submits the job? The code trunk will be huge, but at least it should error out at the same place.

yashpatel6 · 2023-06-30T18:41:36Z

How about we put job monitoring to the same process that submits the job? The code trunk will be huge, but at least it should error out at the same place.

We could do that but the issue there becomes a limitation on how many submission processes can run. If we use an F2 node, we would end up with only 2 submissions at the same time since the monitoring would keep that process and that CPU allocated to that process occupied until the job completes/fails and the status gets reported.

Yeah I'm fine with different error stratagey handling but for hardware issues like this, subsequent samples will likely be sent to the same node and fail. Of couse we could just document it down and let users know this can happen if we decide it this way.

That is a possibility but from the metapipeline's perspective, that's not something the metapipeline can deal with without extensive checks/functions put in place. Conceptually, it also shouldn't really be up to the metapipeline to detect all possible ranges of external failure causes and handle them individually.

From a more general view, we explicitly set the metapipeline up so that even if some samples fail, the rest would continue. The cause of those failures could vary (hardware issue, sample too large so it fails, etc.) but it made more sense that if a user runs many samples with the metapipeline, it's preferable to ignore occasional failures and report them while letting the other samples complete rather than cutting off all samples when just one sample fails.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Individual pipeline failure not propagated to the leading job #119

Individual pipeline failure not propagated to the leading job #119

zhuchcn commented Jun 29, 2023

yashpatel6 commented Jun 29, 2023 •

edited

Loading

yashpatel6 commented Jun 29, 2023

zhuchcn commented Jun 30, 2023

yashpatel6 commented Jun 30, 2023

zhuchcn commented Jun 30, 2023

yashpatel6 commented Jun 30, 2023

zhuchcn commented Jun 30, 2023

zhuchcn commented Jun 30, 2023

yashpatel6 commented Jun 30, 2023

Individual pipeline failure not propagated to the leading job #119

Individual pipeline failure not propagated to the leading job #119

Comments

zhuchcn commented Jun 29, 2023

yashpatel6 commented Jun 29, 2023 • edited Loading

yashpatel6 commented Jun 29, 2023

zhuchcn commented Jun 30, 2023

yashpatel6 commented Jun 30, 2023

zhuchcn commented Jun 30, 2023

yashpatel6 commented Jun 30, 2023

zhuchcn commented Jun 30, 2023

zhuchcn commented Jun 30, 2023

yashpatel6 commented Jun 30, 2023

yashpatel6 commented Jun 29, 2023 •

edited

Loading