Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial sketch for the mriqc/fmriprep singularity based workflow #438

Draft
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Jul 18, 2019

An initial local attempt was slowed down by WTF of ReproNim/containers#23

Preprint which inspired the fmriprep workflow here is http://dx.doi.org/10.1101/694364

Edit: an example of prototypical use of reproman for mriqc is now formalized in README: https://github.com/ReproNim/reproman#step-2-create-analysis-datalad-dataset-and-run-computation-on-aws-hpc2 . I will keep this PR open until we furnish a full workflow which would also include fmriprep (thus handling secret license for matlab) and then some linear modeling

@codecov
Copy link

codecov bot commented Jul 18, 2019

Codecov Report

Attention: Patch coverage is 92.65537% with 13 lines in your changes missing coverage. Please review.

Project coverage is 89.66%. Comparing base (b6528a1) to head (52c19fc).
Report is 231 commits behind head on master.

Files with missing lines Patch % Lines
reproman/support/exceptions.py 45.45% 6 Missing ⚠️
reproman/support/jobs/orchestrators.py 88.88% 5 Missing ⚠️
reproman/interface/jobs.py 60.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #438      +/-   ##
==========================================
+ Coverage   89.47%   89.66%   +0.18%     
==========================================
  Files         148      148              
  Lines       12290    12389      +99     
==========================================
+ Hits        10997    11109     +112     
+ Misses       1293     1280      -13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


# Sample run without any parallelization, and doing both levels (participant and group)
reproman run --follow -r "${RM_RESOURCE}" --sub "${RM_SUB}" --orc "${RM_ORC}" \
--jp container=containers/bids-mriqc data/bids data/mriqc participant,group
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the original, datalad command would be

datalad containers-run -n containers/bids-mriqc data/bids data/mriqc participant,group

TODO : add inputs/outputs specification

# - datalad-container

RM_RESOURCE=smaug
RM_SUB=condor
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for local execution it could be

RM_RESOURCE=localshell
RM_SUB=local

reproman run --follow -r "${RM_RESOURCE}" --sub "${RM_SUB}" --orc "${RM_ORC}" \
--bp 'thing=thing-*' \
--input '{p[thing]}' \
sh -c 'cat {p[thing]} {p[thing]} >doubled-{p[thing]}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest push to run-subjobs (ac14277) checked out, try

reproman run --follow -r "${RM_RESOURCE}" --sub "${RM_SUB}" --orc "${RM_ORC}" \
  --jp container=containers/bids-mriqc \
  --bp 'pl=02,13' \
  --input data/bids \
  data/bids data/mriqc participant --participant_label '{p[pl]}'

I was able to get that [*] to successfully run via condor on smaug. As you've already experienced, the management of existing datasets is a bit rough, so you may want to use a fresh dataset.

[*] Or more specifically, this script:

script
#!/bin/sh
set -eu

cd $(mktemp -d --tmpdir=. ds-XXXX)
datalad create -c text2git .
datalad install -d . ///repronim/containers
datalad install -d . -s https://github.com/ReproNim/ds000003-demo data/bids

mkdir licenses/
echo freesurfer.txt > licenses/.gitignore
cat > licenses/README.md <<EOF

Freesurfer
----------

Place your FreeSurfer license into freesurfer.txt file in this directory.
Visit https://surfer.nmr.mgh.harvard.edu/registration.html to obtain one if
you don't have it yet - it is free.

EOF
datalad save -m "DOC: licenses/ directory stub" licenses/

datalad create -d . data/mriqc

reproman run --resource sm --follow \
         --sub condor --orc datalad-pair-run \
         --jp container=containers/bids-mriqc --bp 'pl=02,13' \
         -i data/bids \
         data/bids data/mriqc participant --participant_label '{p[pl]}'

Unfortunately initial run has failed with

	2019-08-15 14:32:13,311 [INFO   ] Waiting on job 1848: running
	2019-08-15 14:32:23,478 [INFO   ] Fetching results for 20190815-142046-33ea
	2019-08-15 14:35:51,720 [INFO   ] Creating run commit in /home/yoh/proj/repronim/reproman-master/docs/usecases/bids-fmriprep-workflow-NP/out7
	2019-08-15 14:36:06,509 [INFO   ] Unregistered job 20190815-142046-33ea
	+ reproman_run --jp container=containers/bids-mriqc --input data/bids --output data/mriqc "{inputs}" "{outputs}" group
	+ reproman run --follow -r smaug --sub condor --orc datalad-pair-run --jp container=containers/bids-mriqc --input data/bids --output data/mriqc "{inputs}" "{outputs}" group
	2019-08-15 14:36:10,588 [INFO   ] No root directory supplied for smaug; using "/home/yoh/.reproman/run-root"
	[INFO   ] Publishing <Dataset path=/home/yoh/proj/repronim/reproman-master/docs/usecases/bids-fmriprep-workflow-NP/out7/data/mriqc> to smaug
	ECDSA host key for IP address "129.170.233.9" not in list of known hosts.
	[INFO   ] Publishing <Dataset path=/home/yoh/proj/repronim/reproman-master/docs/usecases/bids-fmriprep-workflow-NP/out7> to smaug
	[ERROR  ] failed to push to smaug: master -> smaug/master [rejected] (non-fast-forward); pushed: ["d145d97..97de059"] [publish(/home/yoh/proj/repronim/reproman-master/docs/usecases/bids-fmriprep-workflow-NP/out7)]
	2019-08-15 14:36:59,238 [ERROR  ] "datalad publish" failed. Try running "datalad update -s smaug --merge --recursive" first [orchestrators.py:prepare_remote:792] (OrchestratorError)
	CONTAINERS_REPO=~/proj/repronim/containers INPUT_DATASET_REPO=    70.57s user 22.71s system 9% cpu 16:50.41 total

and stderr.1 on remote end showed that tar failed to find some output file:

    $> tail -n 3 stderr.1
    tar: ./work/workflow_enumerator/anatMRIQCT1w/ComputeIQMs/_in_file_..home..yoh...reproman..run-root..44671e06-bf85-11e9-95c1-8019340ce7f2..data..bids..sub-02..anat..sub-02_T1w.nii.gz/ComputeQI2/_0x9713a172faade86794f9c56a3080a44e_unfinished.json: Cannot stat: No such file or directory
    tar: ./work/workflow_enumerator/anatMRIQCT1w/ComputeIQMs/_in_file_..home..yoh...reproman..run-root..44671e06-bf85-11e9-95c1-8019340ce7f2..data..bids..sub-02..anat..sub-02_T1w.nii.gz/ComputeQI2/error.svg: Cannot stat: No such file or directory
    tar: Exiting with failure status due to previous errors
@yarikoptic
Copy link
Member Author

@kyleam I have reran the script and got the same failure due to tar -- could you please confirm that you get the same?

@kyleam
Copy link
Contributor

kyleam commented Aug 16, 2019 via email

@kyleam
Copy link
Contributor

kyleam commented Aug 16, 2019

could you please confirm that you get the same?

Sure, I'll give it a try this afternoon.

I've triggered it, though I don't yet have an explanation of what's going on.

kyleam added a commit to kyleam/niceman that referenced this pull request Aug 16, 2019
After a command completes, it writes to "status.$subjob".  If, after
completing its command, a subjob sees that the status files for all
the other subjobs are in, it claims responsibility for the
post-processing step.  For the datalad-run orchestrators,
post-processing includes calling `find` to get a list of newly added
files and then calling `tar` with these files as input.

Given that the above procedure waits until each command exits, the
hope is that all the output files are created and any temporary files
will have been cleaned up.  But we're hitting into cases [*] where
apparently intermediate files are present for the `find` call but gone
by the time `tar` is called.  This leads to `tar` exiting with a
non-zero status and the post-processing being aborted.

Until someone has a better idea of how to deal with this, instruct
`tar` to exit with zero even if an expected file isn't present.  This
allows post-processing to succeed and the incident will still show up
in the captured stderr.

[*] ReproNim#438 (comment)
@kyleam
Copy link
Contributor

kyleam commented Aug 16, 2019

I've triggered it, though I don't yet have an explanation of what's going on.

Hmm, with several attempts, I was able to trigger the failure only once. Looking at the successful runs, the togethome file does not include the files that tar is complaining about in the failed runs. The only explanation I have for that is that these are temporary files that, based on the timing of things, might end getting removed between the find ... >togethome call and the tar call. I've submitted gh-451 as a workaround.

script

For completeness: In all the above tries, I was using this script, which is a stripped-down version of 5b95ded:docs/usecases/bids-fmriprep-workflow-NP.sh.

set -eu

cd $(mktemp -d --tmpdir=. ds-XXXX)
datalad create -c text2git .
datalad install -d . ///repronim/containers
datalad install -d . -s https://github.com/ReproNim/ds000003-demo data/bids

mkdir licenses/
echo freesurfer.txt > licenses/.gitignore
cat > licenses/README.md <<EOF

Freesurfer
----------

Place your FreeSurfer license into freesurfer.txt file in this directory.
Visit https://surfer.nmr.mgh.harvard.edu/registration.html to obtain one if
you don't have it yet - it is free.

EOF
datalad save -m "DOC: licenses/ directory stub" licenses/

datalad create -d . -c text2git data/mriqc

reproman run --resource sm --follow \
         --sub condor --orc datalad-pair-run \
         --jp container=containers/bids-mriqc --bp 'pl=02,13' \
         -i data/bids -o data/mriqc \
         '{inputs}' '{outputs}' participant --participant_label '{p[pl]}'

@yarikoptic
Copy link
Member Author

The only explanation I have for that is that these are temporary files that, based on the timing of things, might end getting removed between the find ... >togethome call and the tar call.

indeed probably with NFS etc we could see even more of such usecases. But I am still wondering what is exactly happening here besides may be condor returning "complete" job status before it (and all kids) actually finished, and either we wouldn't miss some results if we rush into collecting/tarring them up. May be adding some fuser call to check if any process is still holding on that output path or alike. I will try to look into it when I get a moment.

@kyleam
Copy link
Contributor

kyleam commented Aug 20, 2019

indeed probably with NFS etc we could see even more of such usecases. But I am still wondering what is exactly happening here

I am still wondering too :)

besides may be condor returning "complete" job status before it (and all kids) actually finished

This status isn't coming from condor. Its creation is chained after the run of the command:

/bin/sh -c "$cmd" && \
echo "succeeded" >"$metadir/status.$subjob" || \
(echo "failed: $?" >"$metadir/status.$subjob";
mkdir -p "$metadir/failed" && touch "$metadir/failed/$subjob")

@yarikoptic
Copy link
Member Author

Ran current state (v0.2.1-43-ga81e457) on smaug as

FS_LICENSE=~/.freesurfer-license RUNNER=datalad ./bids-fmriprep-workflow-NP.sh bids-fmriprep-workflow-NP/out4

and it finished nicely (although there was a puke from fmriprep's atexit about "OSError: handle is closed" which I guess was just ignored), and everything looked kosher BUT "data/fmriprep//fmriprep/dataset_description.json" was added to git-annex not git! It is strange. The whole "data/fmriprep" was initiated with -c txt2git and it seemed to work -- .tsv etc were committed to git but all jsons to annex. It reminded me about https://git-annex.branchable.com/bugs/manages_to_incorrectly_add_to_annex_instead_of_git_based_on___34__mimetype__34___-_we_cannot_figure_it_out_why/ which Joey "done, the magic database changing behavior is not a bug in git-annex". I naively had wrong impression that this issue was actually fixed somehow properly.
Well -- on libmagic side -k was indeed fixed around then, so it is possible to obtain multiple mimetypes, and the 2nd one for .json is text:

$> file --mime-type -Lk 1.json
1.json: application/json\012- text/plain

@yarikoptic
Copy link
Member Author

oh hoh -- that is a good old datalad/datalad#3361 which we didn't really address although git annex provided a way already :-/

Copy link
Member Author

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment that it was a workaround

@@ -222,6 +222,7 @@ containers/scripts/freeze_versions --save-dataset=^ \
poldracklab-ds003-example=0.0.3 \
bids-mriqc=0.15.0 \
bids-fmriprep=1.4.1
( cd containers/ ; git clean -dfx ; )
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is a temporary workaround to remove .datalad/config-e created by above freeze_versions script. Proper solution sounds fix that script to stay compatible with osx.

kyleam added a commit that referenced this pull request May 7, 2020
kyleam pushed a commit to kyleam/niceman that referenced this pull request May 25, 2020
When running the usecase example ReproNimgh-438, the working tree is coming up
as unexpectedly dirty, including the change of status files from empty
to "succeed", apparently indicating that the content of the file is
not flushed at the time of the 'datalad add/save' call.  To prevent
saving before the content is flushed, don't count a status file unless
it has the content.

Modified-by: Kyle Meyer <[email protected]>
  Add commit message body, and dropped `find` call, as suggested by
  @yarikoptic.
yarikoptic and others added 2 commits May 25, 2020 19:16
* origin/master:
  BF: runscript - only consider statuses for success/failure
  Fix runscript regexp to work on Mac OS

Conflicts:
	reproman/support/jobs/job_templates/runscript/base.template.sh - took master version since it incorporated Chris' fixup for OSX
Copy link
Member Author

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

chaselgrove and others added 5 commits May 27, 2020 10:27
Mac sed doesn't handle tabs as linux does, so we start by converting the
TSV file to CSV, then process from there.
For some reason otherwise it immediately fails on smaug if I do not do that

...
    Installing setuptools, pkg_resources, pip, wheel...done.
    > source venv3/bin/activate
    >> deactivate nondestructive
    >> unset -f pydoc
    >> '[' -z '' ']'
    >> '[' -z '' ']'
    >> '[' -n /bin/bash ']'
    >> hash -r
    >> '[' -z '' ']'
    >> unset VIRTUAL_ENV
    >> '[' '!' nondestructive = nondestructive ']'
    >> VIRTUAL_ENV=/home/yoh/.tmp/rm-TIAZarc/venv3
    >> export VIRTUAL_ENV
    >> _OLD_VIRTUAL_PATH=/home/yoh/bin:/home/yoh/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/sbin:/usr/sbin:/
    usr/local/sbin
    >> PATH=/home/yoh/.tmp/rm-TIAZarc/venv3/bin:/home/yoh/bin:/home/yoh/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/sbin:/usr/sbin:/usr/local/sbin
    >> export PATH
    >> '[' -z '' ']'
    >> '[' -z '' ']'
    venv3/bin/activate: line 57: PS1: unbound variable
    >>> echo Finished for setup=pip under PWD=/home/yoh/.tmp/rm-TIAZarc
    Finished for setup=pip under PWD=/home/yoh/.tmp/rm-TIAZarc
locally had 0.12.6 and it might be interfering
…) into doc-usecases

* origin/master: (21 commits)
  MNT: setup.py: Add optional DataLad dependency
  ENH: Raise a dedicated JobError if any subjob fails
  ENH: run: Specify constraints for --sub and --orc
  ENH: Submitter.follow - a smart --follow sleeping/reporting
  ENH: runcsript - print "Waiting" once and pool each second
  ENH: runscript - remove no longer necessary and anyways ineffective sleep 1
  ENH: skip with a warning a job which fails to render due to a missing key
  BF: orchestrators: Call create-sibling unconditionally
  ENH: orchestrators: Provide access to datalad-publish results
  ENH: orchestrators(datalad-pair): Do a more targeted update
  ENH: orchestrators: Make head_at() helper abort on submodule changes
  BF: orchestrators: Fetch with --recurse-submodules=no
  BF: orchestrators: Initialize submodules
  ENH: orchestrators: Meld handling of local and SSH sessions
  RF: test_orchestrators: Use DataLad's repo.call_git()
  MNT: orchestrators: Drop warning about an unsupported DataLad version
  RF: orchestrators: Adjust datalad run imports
  RF: orchestrators: Prune now unneeded SSH URL compatibility kludge
  RF: run: Use 'datalad save' instead of 'datalad add'
  RF: orchestrators: Check whether datalad >=0.13 is installed locally
  ...
…ly (not as {inputs})

That should allow to upload all needed licenses to the submitting host
This was referenced May 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants