Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SLURM support for multi-node tests #65

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

roclark
Copy link
Member

@roclark roclark commented Apr 5, 2021

To make it easier to run on large clusters, Bobber should be able to run on SLURM clusters with Pyxis and Enroot installed. This would replace the need for mpirun and SSH keys/daemons inside the containers, making it easier to run tests without copying images between nodes or synchronizing SSH keys.

Closes #1

Signed-Off-By: Robert Clark [email protected]

@roclark roclark added enhancement New feature or request slurm Any items related to running tests with SLURM labels Apr 5, 2021
@roclark roclark requested review from joehandzik and fredvx April 5, 2021 18:52
@roclark roclark self-assigned this Apr 5, 2021
@roclark
Copy link
Member Author

roclark commented Apr 5, 2021

This is currently a draft based on the ongoing discussion in #1. At this point, the NCCL tests should be fully functional using the Python wheel. As I see it, the following items are still required:

  • Add DALI tests
  • Add FIO tests
  • Add mdtest
  • Document the installation and usage
  • Update the troubleshooting guide with steps to fix common issues
  • Get a public image up on NGC

@roclark roclark mentioned this pull request Apr 5, 2021
@roclark roclark force-pushed the slurm-support branch 2 times, most recently from b41f382 to 4866c75 Compare April 5, 2021 19:14
bobber/lib/system/slurm.py Outdated Show resolved Hide resolved
@roclark roclark force-pushed the slurm-support branch 10 times, most recently from e1cfc68 to e1125ff Compare April 8, 2021 16:45
To make it easier to run on large clusters, Bobber should be able to run
on SLURM clusters with Pyxis and Enroot installed. This would replace the
need for mpirun and SSH keys/daemons inside the containers, making it
easier to run tests without copying images between nodes or synchronizing
SSH keys.

Signed-Off-By: Robert Clark <[email protected]>
While using Slurm, it is entirely possible to still use Bobber but not
have Docker installed on the head node where the jobs will be launched.
In this case, Docker should be ignored unless one of the commands
directly needs the Docker runtime.

Signed-Off-By: Robert Clark <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request slurm Any items related to running tests with SLURM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update mechanism for synchronizing SSH keys
2 participants