Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: job_templates: Call tar with --ignore-failed-read #451

Closed
wants to merge 1 commit into from

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Aug 16, 2019

After a command completes, it writes to "status.$subjob". If, after
completing its command, a subjob sees that the status files for all
the other subjobs are in, it claims responsibility for the
post-processing step. For the datalad-run orchestrators,
post-processing includes calling find to get a list of newly added
files and then calling tar with these files as input.

Given that the above procedure waits until each command exits, the
hope is that all the output files are created and any temporary files
will have been cleaned up. But we're hitting into cases [*] where
apparently intermediate files are present for the find call but gone
by the time tar is called. This leads to tar exiting with a
non-zero status and the post-processing being aborted.

Until someone has a better idea of how to deal with this, instruct
tar to exit with zero even if an expected file isn't present. This
allows post-processing to succeed and the incident will still show up
in the captured stderr.

[*] #438 (comment)

After a command completes, it writes to "status.$subjob".  If, after
completing its command, a subjob sees that the status files for all
the other subjobs are in, it claims responsibility for the
post-processing step.  For the datalad-run orchestrators,
post-processing includes calling `find` to get a list of newly added
files and then calling `tar` with these files as input.

Given that the above procedure waits until each command exits, the
hope is that all the output files are created and any temporary files
will have been cleaned up.  But we're hitting into cases [*] where
apparently intermediate files are present for the `find` call but gone
by the time `tar` is called.  This leads to `tar` exiting with a
non-zero status and the post-processing being aborted.

Until someone has a better idea of how to deal with this, instruct
`tar` to exit with zero even if an expected file isn't present.  This
allows post-processing to succeed and the incident will still show up
in the captured stderr.

[*] ReproNim#438 (comment)
@yarikoptic
Copy link
Member

I kept thinking about it and so far treat it (failure) as a "feature". We aren't catching complete shutdown of the underlying process or some filesystem effect.
In the particular failure which triggered this patch I should have anyways added explicit specification of a (temporary) work directory for mriqc which I should .gitignore -- those files aren't really to be committed, or should have had them in a subdataset (although nipype probably would freak out whenever would see symlinks upon rerun).
I am still picking at this to see if may be we could make use of fuser or smth like that which would inform us that there is still processes which are interested in the path

@kyleam
Copy link
Contributor Author

kyleam commented Aug 23, 2019

I kept thinking about it and so far treat it (failure) as a "feature".

I'm not really seeing it. Sure, it's useful for us to know there's something funky going on. But until we have a clear understanding of the issue and how to fix it, it seems unnecessary to make the post-processing completely fail because a file (very likely an uninteresting and temporary one) was removed between the find and tar calls.

But ok, let's hold off on this, and revisit it if we hit into it in other scenarios.

@kyleam
Copy link
Contributor Author

kyleam commented Oct 10, 2019

I'll close this for now. We can resurrect it if desired at a later point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants