Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create batch_queue_flux as experimental module #3978

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

dblitt
Copy link
Contributor

@dblitt dblitt commented Nov 12, 2024

Proposed Changes

This PR creates a batch_queue module for the Flux Framework. batch_queue_flux is currently marked as experimental. The submit, wait, and remove functionality are implemented, but there is no support for staging out output files back to the submitter because it seems flux assumes a shared filesystem for this purpose. As a result, vine_factory works using this backend, but more complicated workflows in Makeflow will not work, as they assume output files will be staged back.

I also fixed some out of date documentation in batch_queue.h

Merge Checklist

The following items must be completed before PRs can be merge.
Check these off to verify you have completed all steps.

  • make test Run local tests prior to pushing.
  • make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
  • make lint Run lint on source code prior to pushing.
  • Manual Update Update the manual to reflect user-visible changes.
  • Type Labels Select a github label for the type: bugfix, enhancement, etc.
  • Product Labels Select a github label for the product: TaskVine, Makeflow, etc.
  • PR RTM Mark your PR as ready to merge.

@dblitt dblitt requested a review from tphung3 November 12, 2024 20:56
@tphung3
Copy link
Contributor

tphung3 commented Nov 13, 2024

Ok, this is a good template for further work. I think you know Flux enough by now to implement batch_queue_flux_submit and batch_queue_flux_remove. Don't worry about supporting all features, and instead try to submit basic Linux commands to flux and remove them using above APIs. To test your work you can write a custom test script calling batch_queue_flux_submit and batch_queue_flux_remove.

@dblitt dblitt changed the title Create batch_queue_flux stub Create batch_queue_flux as experimental module Dec 2, 2024
@dblitt
Copy link
Contributor Author

dblitt commented Dec 2, 2024

@tphung3 @dthain I've pushed the implementation for {submit,wait,remove}

batch_job/src/batch_queue_flux.c Outdated Show resolved Hide resolved
batch_job/src/batch_queue_flux.c Show resolved Hide resolved
}
}

// Flux does not support staging files out of the worker environment, so warn for each file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch queue Condor also doesn't stage out files. I wonder why Flux doesn't work with Makeflow then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this requires further investigation. I think it has to do with the location of the working directory the worker node is placed into

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_queue_condor actually does stage out files. In HTCondor, all files created in the sandbox are sent back to the submitter, it isn't necessary to name them individually.

batch_job/src/batch_queue_flux.c Outdated Show resolved Hide resolved
batch_job/src/batch_queue_flux.c Show resolved Hide resolved
batch_job/src/batch_queue_flux.c Show resolved Hide resolved
batch_job/src/batch_queue_flux.c Show resolved Hide resolved
batch_job/src/batch_queue_flux.c Outdated Show resolved Hide resolved
@dthain
Copy link
Member

dthain commented Dec 5, 2024

Per our previous discussion, it is ok to assume that the flux executor makes use of a shared filesystem, and it is not necessary to transfer files back and forth. (That is also the general assumption of the "cluster" module)

@tphung3
Copy link
Contributor

tphung3 commented Dec 5, 2024

I talked with Jim Garlick who's in the Flux dev team, and he told me that they don't have a use case for stage-out yet.

@dthain
Copy link
Member

dthain commented Dec 13, 2024

This is looking pretty good overall, can you please add a suitable paragraph to the makeflow manual describing the module?

@dthain
Copy link
Member

dthain commented Dec 19, 2024

@tphung3 any other concerns?

@tphung3
Copy link
Contributor

tphung3 commented Dec 19, 2024

No, LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants