-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EMCEE Batch Submit CLI Tool #394
Open
TimothyWillard
wants to merge
78
commits into
dev
Choose a base branch
from
GH-365/emcee-submit
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Functionality to render jinja2 templates to make creating batch files easier. Also opens up the door to future use of templates like creating config files. Associated docs+tests included.
Added `gempyor.logging` module with `_get_logging_level`, `get_script_logger`, and `ClickHandler` along with corresponding documentation and unit tests.
Added private helper `gempyor.utils._shutil_which` which has a `check` argument that behaves very similar in spirit to the same argument of `subprocess.run`. Corresponding docs/tests included.
Pytest hooks into the root logger, which breaks the use case here where we only want a single handler to emit a log record. Create a special carve out for pytest and put it in `get_script_logger` to make testing logging results of any function easy.
Internal utility to submit a batch file to slurm via `sbatch` along with detailed logging and corresponding unit tests and docs.
Added an internal utility function to get the sha commit corresponding to the head of a given git repository. For use in creating manifest files.
Added `gempyor.manifest.write_manifest` for writing batch manifest JSON files containing metadata about the run. Also has extended functionality to store more metadata then the current required.
The `manifest` module will be too small to have as its own, moved into `batch` and will start consolidating other batch functionality into as well.
Moved the `_slurm` module into the new `batch` module to similar reasons as `manifest`. There is not enough slurm functionality to justify an individual module and slurm interactions will only occur in the context of working with batch jobs.
Added a dataclass to represent batch job sizes, including light validation.
Added a class method to `gempyor.batch.JobSize` class that is slightly more convenient to use from a CLI script. Provides a good home for future code realted to GH-360. Corresponding tests and docs.
In anticipation of GH-358, minimize future work.
Added internal utility to convert a dict of options into a list appropriate for `subprocess.run` and similar.
Added `gempyor.batch._sbatch_template` that is a thin wrapper around `_sbatch` that takes advantage of templating. There are docs, but no unit tests. Still determining how this interface should be best setup.
Draft solution to GH-342 with the ability to extend to other types of generic config/info in the future. Add `gempyor.info` module with exported function `get_cluster_info` which pulls and parses an information yaml into a pydantic model.
Very draft, needs to be better connected to a CLI to fully implement.
Added specification for path modifications to the cluster info (namely to add AWS cli on rockfish to path). Specification includes the path to add, prepend vs append, and a flag to error if path is missing.
Not fleshed out at all, but at least contains the main sections that will need to be filled in. Now includes dynamic modules and $PATH modifications that mirror the objects returned from `get_cluster_info`.
Just rebased `dev` into `GH-365/emcee-submit` which now contains black formatting for python.
Added a helper to log the inputs given to a CLI tool when the verbosity is set to debug. Mostly for user/dev debugging. Also added corresponding docs and unit tests.
Internal utility to generate a human readable unique batch job name along with corresponding docs and tests.
Added internal `gempyor.batch._resolve_batch_system` utility to make sense of batch system CLI options.
Very preliminary work on the `flepimop batch` CLI tool including some light testing for the not implemented errors thrown.
* Changed to generic import of `click` since the `gempyor.batch` module relies on it heavily. * Add manifest json to `flepimop batch`. * Add subpops to the job size calculation of `flepimop batch`. * Leftover TODO note about resume/continuation locations.
Added a representation of batch job time limits, similar to `JobSize`, along with corresponding documentation and unit tests.
Added the `JobTimeLimit.from_per_simulation_time` class method to easily create instances from user provided inputs. Also added comparison methods for ease of unit testing.
Started work constructing a call to `_sbatch_template` in the `flepimop batch` CLI. Incorporated `JobTimeLimit` class and added additional comments into the `_click_batch` function.
Added the `DurationParamType` for specifying custom durations in a user friendly way for CLI interfaces along with corresponding unit tests. Also incorporated the new custom param type into the `flepimop batch` CLI.
* Changed default from `timedelta` to `str`. Looks like the `default` value is passed to the `convert` method of `type` if instance of `click.ParamType`. * Began work incorporating cluster info into batch submission with slurm.
* Final config file is now parsed out to a file, either user specified or temp. * Further work on the batch script, mostly in translating variables given to `_click_batch` to the template.
TimothyWillard
force-pushed
the
GH-365/emcee-submit
branch
from
November 14, 2024 19:54
6491367
to
821210d
Compare
TimothyWillard
force-pushed
the
GH-365/emcee-submit
branch
from
November 25, 2024 18:24
2ecab37
to
45747e5
Compare
Also added unit tests, but still need to add documentation.
* Fleshed out documentation for the `gempyor.info/shared_cli` modules. * Added the ability to optionally return the converted memory as an int instead of as a float to `MemoryParamType`. * Removed pre-instantiated custom click param types.
TimothyWillard
force-pushed
the
GH-365/emcee-submit
branch
from
November 25, 2024 18:27
45747e5
to
0fa1f9b
Compare
Bug in `MemoryParamType.convert` was failing if a unit was not given with a value. Similar to `DurationParamType` it assumes that numbers without a unit are given in the default unit of the type.
Moved from the slurm specific code path in `_submit_scenario_job`, should allow for using the `--local` flag on the cluster.
TimothyWillard
force-pushed
the
GH-365/emcee-submit
branch
from
November 25, 2024 21:36
d36aca0
to
4f936db
Compare
Made the `ValueError` from `_infer_cluster_from_fqdn` optional, but default to `True` for current behavior.
* Export `$FLEPI_PATH` in the GH action before running tests to run the unit tests for `get_cluster_info`. * Attempt to fix Jinja2 lader issue by falling back to a file system loader if a package loader fails.
TimothyWillard
force-pushed
the
GH-365/emcee-submit
branch
from
November 26, 2024 21:40
02d550e
to
b76bb3b
Compare
To deal with a `DeprecationWarning` that python keeps producing.
TimothyWillard
force-pushed
the
GH-365/emcee-submit
branch
from
November 26, 2024 22:48
445624f
to
6f51a3d
Compare
TimothyWillard
force-pushed
the
GH-365/emcee-submit
branch
from
November 26, 2024 23:00
f74cdd5
to
0149bc7
Compare
While this PR is ready for review I think it's unlikely there will be sufficient time to get this into the upcoming release. If folks have the bandwidth to review that's great, but definitely prioritize the other PRs noted in slack first. |
Thanks Tim - I intended to test this functionality properly when I'm done with RSV too and that might take a while to cover all bases. Will prioritise the others first. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
batch
Relating to batch processing.
enhancement
Request for improvement or addition of new feature(s).
gempyor
Concerns the Python core.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Describe your changes.
This pull request adds a
flepimop submit
subcommand for submitting batch jobs. Still a work in progress, so it only works for EMCEE inference with slurm. A quick example of how using this would look is:Although I will add more documentation/usage guides after some initial refinement.
Running todo list of what's remaining:
gempyor.utils._git_checkout
gempyor.batch._click_submit
, including thinking about initial attempts at integration like testing.gempyor.batch._local/_local_template/_sbatch/_sbatch_template
and add further unit testing (especially for the_local/sbatch_template
functions`).gempyor.batch._submit_scenario_job
.inference_*.j2
templates. Only has to concern a few details since the individual components are well tested._infer_cluster_from_fqdn
togempyor.info
to "guess" the cluster fromsocket.getfqdn()
if not given an explicit one.gempyor.shared_cli.DurationParamType/MemoryParamType
.Does this pull request make any user interface changes? If so please describe.
The user interface changes are the addition of a
flepimop submit
subcommand. This is new and causes no other interfaces changes.These changes have not been documented yet, this is a draft PR for some initial feedback but will add appropriate documentation after some refinement.
What does your pull request address? Tag relevant issues.
This pull request addresses:
gempyor
#310