This repository has been archived by the owner on Oct 10, 2019. It is now read-only.
forked from prelz/BLAH
-
Notifications
You must be signed in to change notification settings - Fork 12
Feature/slurm no shared filesystem #59
Open
PerilousApricot
wants to merge
3
commits into
opensciencegrid:v1_18_bosco
Choose a base branch
from
PerilousApricot:feature/slurm-no-shared-filesystem
base: v1_18_bosco
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Feature/slurm no shared filesystem #59
PerilousApricot
wants to merge
3
commits into
opensciencegrid:v1_18_bosco
from
PerilousApricot:feature/slurm-no-shared-filesystem
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Check for the presence of stageout hook. If found, execute it. The hook has to be executed at this point in the script because it's between when the process completes and the signal traps fire on exit. The job-specific plugins don't have a way to inject code that particular interval
This is (obviously) in a big need of a cleanup. Hardcoded values need to be removed and made configurable, the authentication should change from IP-based to shared-secret, etc... I currently have it set to start automatically when the condor-ce service starts, but it probably needs to be made conditionl via some sort of ExecPre or similar. Anyway, this is what's working locally
Instead of assuming all filesystems are shared, assume no filesystems are shared and instead move files via curl. Set it up so the input files are moved inline, but set up a hook function to transfer the output files once the job completes (the hook is called in blah_common_submit_function.sh) This also needs to be configurable instead of hardcoded and generally could use a cleaning, but is what we're running @ Vanderbilt
This is an output submission script generated by this code:
|
Ah, I thought this was somehow going to be using built-in slurm file transfer. This is a little scarier so I'll review when I have more time. |
Yeah, unfortunately SLURM doesn't have its own inbuilt file transfer mechanism :/ |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
These are the changes needed to support configurations where the CEs and worker nodes don't share a filesystem. There's two parts:
Unfortunately, SLURM doesn't have a native functionality to move files from the submission node to/from the execution node (even though it's a spiritual successor to PBS, which does), so there's two choices, each with their own downsides:
I went with 1 because it seemed less difficult than trying to sort out the authentication hassle for 2.
There's still work that needs to be done to generalize it (see the commit messages), but it's what we've been using at Vanderbilt for a few months now.