-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Individual pipeline failure not propagated to the leading job #119
Comments
The main log file should have indicated that the samples failed with that message on the line you've highlighted, was that not the case? |
I did try having |
Oh I do see the log message. But the final status is success, so is the status in the trace txt file which is also a little confusing.
If we print the message, but also set exit to 1, will the printed message be hidden? |
Yeah this was the issue I observed, if the process exits 1, then the echo statements end up not getting printed to the log file. We could potentially add another process to handle reporting the failure through exit 1 to have both the message printed by the current process and the failure indicated explicitly by the other process |
I'm thinking whether it's worth fixing but it seems to be a problem setting errorStratagy to ignore because all subsequent samples will continue. So if we echo the message and exit with 1, we should still be able to find the message from the .command.log file? I'm thinking if that's actually easier. |
This is the behavior we would want though, even if some samples fail, the rest of them should continue. The issue comes from the interaction between having the strategy set to ignore, exiting 1, and also printing to the main log file.
This is true, the message would still be in the .command.log file. Generally, I find it's easier for end users to identify which samples failed through the main log file rather than having to go through multiple files to try to identify the logs for failed samples, which is why I'd lean towards having an extra process but keeping the messages with the direct path to failed sample work dirs in the main log. |
Yeah I'm fine with different error stratagey handling but for hardware issues like this, subsequent samples will likely be sent to the same node and fail. Of couse we could just document it down and let users know this can happen if we decide it this way. |
How about we put job monitoring to the same process that submits the job? The code trunk will be huge, but at least it should error out at the same place. |
We could do that but the issue there becomes a limitation on how many submission processes can run. If we use an F2 node, we would end up with only 2 submissions at the same time since the monitoring would keep that process and that CPU allocated to that process occupied until the job completes/fails and the status gets reported.
That is a possibility but from the metapipeline's perspective, that's not something the metapipeline can deal with without extensive checks/functions put in place. Conceptually, it also shouldn't really be up to the metapipeline to detect all possible ranges of external failure causes and handle them individually. From a more general view, we explicitly set the metapipeline up so that even if some samples fail, the rest would continue. The cause of those failures could vary (hardware issue, sample too large so it fails, etc.) but it made more sense that if a user runs many samples with the metapipeline, it's preferable to ignore occasional failures and report them while letting the other samples complete rather than cutting off all samples when just one sample fails. |
Describe the issue
@smahesh12 submitted 3 multi samples, two of them failed because scratch was detached, but the stdout from the leading job says all three samples were successful. I think this is also becuase of the job limiter, so individual pipeline failure was not propagated to the leading job. Maybe we can just raise the error here after this line? Maybe just add
exit 1
?metapipeline-DNA/main.nf
Line 207 in 308abdf
The text was updated successfully, but these errors were encountered: