Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Individual pipeline failure not propagated to the leading job #119

Open
zhuchcn opened this issue Jun 29, 2023 · 9 comments
Open

Individual pipeline failure not propagated to the leading job #119

zhuchcn opened this issue Jun 29, 2023 · 9 comments

Comments

@zhuchcn
Copy link
Member

zhuchcn commented Jun 29, 2023

Describe the issue

@smahesh12 submitted 3 multi samples, two of them failed because scratch was detached, but the stdout from the leading job says all three samples were successful. I think this is also becuase of the job limiter, so individual pipeline failure was not propagated to the leading job. Maybe we can just raise the error here after this line? Maybe just add exit 1?

echo "Process in ${work_dir} failed with non-zero exit code."

@yashpatel6
Copy link
Collaborator

yashpatel6 commented Jun 29, 2023

The main log file should have indicated that the samples failed with that message on the line you've highlighted, was that not the case?

@yashpatel6
Copy link
Collaborator

I did try having exit 1 at that point and setting errorStrategy to ignore but that results in the message indicating which sample/work_dir failed not being printed, instead Nextflow only reports that the status check process failed. That makes it difficult to identify which specific sample/work_dir failed so I left the reporting as a message printed to the main log file without letting the status check process fail.

@zhuchcn
Copy link
Member Author

zhuchcn commented Jun 30, 2023

Oh I do see the log message. But the final status is success, so is the status in the trace txt file which is also a little confusing.

executor >  local (10)
[58/1fa394] process > create_CSV_metapipeline_DNA... [100%] 3 of 3 ✔
[00/27fc23] process > create_config_metapipeline_DNA [100%] 1 of 1 ✔
[34/55b72f] process > call_metapipeline_DNA (3)      [100%] 3 of 3 ✔
[7b/395dd3] process > check_process_status (3)       [100%] 3 of 3 ✔
Process in /hot/project/process/MissingPeptides/MISP-000132-MissingPeptidesPanCanP1/CPTAC_LUAD/leading_work_dir/34/55b72f435df118276d7b50c2824553 failed with non-zero exit code.


Completed at: 27-Jun-2023 01:54:06
Duration    : 3d 7h 36m 56s
CPU hours   : 122.5
Succeeded   : 10

If we print the message, but also set exit to 1, will the printed message be hidden?

@yashpatel6
Copy link
Collaborator

If we print the message, but also set exit to 1, will the printed message be hidden?

Yeah this was the issue I observed, if the process exits 1, then the echo statements end up not getting printed to the log file. We could potentially add another process to handle reporting the failure through exit 1 to have both the message printed by the current process and the failure indicated explicitly by the other process

@zhuchcn
Copy link
Member Author

zhuchcn commented Jun 30, 2023

I'm thinking whether it's worth fixing but it seems to be a problem setting errorStratagy to ignore because all subsequent samples will continue.

So if we echo the message and exit with 1, we should still be able to find the message from the .command.log file? I'm thinking if that's actually easier.

@yashpatel6
Copy link
Collaborator

I'm thinking whether it's worth fixing but it seems to be a problem setting errorStratagy to ignore because all subsequent samples will continue.

This is the behavior we would want though, even if some samples fail, the rest of them should continue. The issue comes from the interaction between having the strategy set to ignore, exiting 1, and also printing to the main log file.

So if we echo the message and exit with 1, we should still be able to find the message from the .command.log file? I'm thinking if that's actually easier.

This is true, the message would still be in the .command.log file. Generally, I find it's easier for end users to identify which samples failed through the main log file rather than having to go through multiple files to try to identify the logs for failed samples, which is why I'd lean towards having an extra process but keeping the messages with the direct path to failed sample work dirs in the main log.

@zhuchcn
Copy link
Member Author

zhuchcn commented Jun 30, 2023

Yeah I'm fine with different error stratagey handling but for hardware issues like this, subsequent samples will likely be sent to the same node and fail. Of couse we could just document it down and let users know this can happen if we decide it this way.

@zhuchcn
Copy link
Member Author

zhuchcn commented Jun 30, 2023

How about we put job monitoring to the same process that submits the job? The code trunk will be huge, but at least it should error out at the same place.

@yashpatel6
Copy link
Collaborator

How about we put job monitoring to the same process that submits the job? The code trunk will be huge, but at least it should error out at the same place.

We could do that but the issue there becomes a limitation on how many submission processes can run. If we use an F2 node, we would end up with only 2 submissions at the same time since the monitoring would keep that process and that CPU allocated to that process occupied until the job completes/fails and the status gets reported.

Yeah I'm fine with different error stratagey handling but for hardware issues like this, subsequent samples will likely be sent to the same node and fail. Of couse we could just document it down and let users know this can happen if we decide it this way.

That is a possibility but from the metapipeline's perspective, that's not something the metapipeline can deal with without extensive checks/functions put in place. Conceptually, it also shouldn't really be up to the metapipeline to detect all possible ranges of external failure causes and handle them individually.

From a more general view, we explicitly set the metapipeline up so that even if some samples fail, the rest would continue. The cause of those failures could vary (hardware issue, sample too large so it fails, etc.) but it made more sense that if a user runs many samples with the metapipeline, it's preferable to ignore occasional failures and report them while letting the other samples complete rather than cutting off all samples when just one sample fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants