Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA support to software_layer #212

Closed
wants to merge 35 commits into from
Closed

Add CUDA support to software_layer #212

wants to merge 35 commits into from

Conversation

ocaisa
Copy link
Member

@ocaisa ocaisa commented Dec 16, 2022

This is part of a logical splitting of #172 to make it a bit more manageable

Requires #228

value = 'add_property("arch","gpu")'
cuda_version = 0
for dep in iter(ec_dict["dependencies"]):
# Make CUDA a build dependency only (rpathing saves us from link errors)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach saves us from explicitly loading CUDA to run a CUDA dependent package. This allows us to write an Lmod hook that protects loading the CUDA module unless certain criteria are met (i.e., that the symlinks are unbroken).

):
ec.log.info("[parse hook] Injecting gpu as Lmod arch property and envvar with CUDA version")
key = "modluafooter"
value = 'add_property("arch","gpu")'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property allows us to protect the loading of any GPU package via an Lmod hook (which can be overridden): unless the compat libraries are installed you can't load GPU modules

@ocaisa ocaisa mentioned this pull request Dec 16, 2022
5 tasks
eb_hooks.py Outdated
target = source.replace("versions", "host_injections")
os.remove(source)
# Using os.symlink requires the existence of the target directory, so we use os.system
os.system("ln %s %s" % (target, source))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be checking that these are succeeding

eb_hooks.py Show resolved Hide resolved
eb_hooks.py Outdated Show resolved Hide resolved
@ocaisa ocaisa marked this pull request as draft December 16, 2022 14:45
@ocaisa
Copy link
Member Author

ocaisa commented Dec 16, 2022

This is working but requires an additional script to unbreak the symlinks in the CUDA installation (which is being extracted from #172 )

@ocaisa ocaisa marked this pull request as ready for review December 19, 2022 22:44
@ocaisa
Copy link
Member Author

ocaisa commented Dec 19, 2022

This requires easybuilders/easybuild-framework#4119 or you install the compat libraries so that the rpath check passes

else
# The install is pretty fat, you need lots of space for download/unpack/install (~3*5GB), need to do a space check before we proceed
avail_space=$(df --output=avail ${cuda_install_dir}/ | tail -n 1 | awk '{print $1}')
if (( ${avail_space} < 16000000 )); then
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should allow this to be overridden

# always be in versions instead of host_injections and have symlinks pointing
# to host_injections for everything we're not allowed to ship
# (existence of easybuild subdir implies a successful install)
if [ -d ${cuda_install_dir}/software/CUDA/${install_cuda_version}/easybuild ]; then
Copy link
Member Author

@ocaisa ocaisa Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent dir also needs to be writable.

This will allow us to log where creating directory structures under `host_injections` is breaking down.
echo "CUDA software found! No need to install CUDA again, proceed with testing."
else
# The install is pretty fat, you need lots of space for download/unpack/install (~3*5GB), need to do a space check before we proceed
avail_space=$(df --output=avail ${cuda_install_dir}/ | tail -n 1 | awk '{print $1}')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that ${cuda_install_dir} exists, which it may not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this space check is a bit clumsy, it's assuming everything happens in one location but there are actually 3 (source, build, install) each of which takes ~5GB

@boegel
Copy link
Contributor

boegel commented Jun 5, 2023

@ocaisa Conflicts to fix.

Shall we re-target this to the (new) 2023.04 branch?

TopRichard pushed a commit to TopRichard/bot-software-layer1 that referenced this pull request Nov 3, 2023
…ld/4.8.2

{2023.06}[system] EasyBuild V4.8.2
@ocaisa
Copy link
Member Author

ocaisa commented Dec 21, 2023

GPU support implemented with #434

@ocaisa ocaisa closed this Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants