Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA support #172

Closed
wants to merge 50 commits into from
Closed
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
065efd1
Add scripts to support CUDA
huebner-m Apr 14, 2022
caa43bf
Fix dump location of check whether CUDA module is installed
huebner-m May 12, 2022
d7212a0
Remove setup script, use shipped init script to set env vars etc. ins…
huebner-m May 12, 2022
48f4455
Check return values and path existence in CUDA tests
huebner-m May 13, 2022
c50daa2
Check return value of eb install, improve source of other scripts
huebner-m May 13, 2022
d4e85cc
Use mktemp to create temporary directory to install compat libs
huebner-m May 13, 2022
590e042
Fix echo
huebner-m May 13, 2022
7b9bb49
Replace explicit dir names with variables, check symlink destination
huebner-m May 16, 2022
01844c6
Install CUDA in modified version of EESSI_SOFTWARE_PATH
huebner-m May 16, 2022
0e8861f
If CUDA install dir exists, add it to EESSI_MODULE_PATH
huebner-m May 16, 2022
7d6af69
Use env var to check for GPU support and add this to module path
huebner-m May 16, 2022
2cc5ce9
Move (conditional) installation of cuda compat libs to external script
huebner-m May 18, 2022
d53e80e
Consistently use EESSI_SITE_MODULEPATH to set up GPU support for Lmod
huebner-m May 18, 2022
850c20e
Rename script to add (NVIDIA) GPU support, add dummy script for AMD GPUs
huebner-m May 19, 2022
9b2e72f
Add shebang
huebner-m May 19, 2022
5f82658
Add option to disable checks, enables installation on nodes w/o GPUs
huebner-m May 19, 2022
16e87af
Allow using an environment variable to skip GPU checks
huebner-m May 19, 2022
cf65a37
Update list of CUDA enabled toolchains
huebner-m May 19, 2022
7319db2
Tell users to use the updated path to enable CUDA support
huebner-m May 19, 2022
6537725
Add protection against warning if CUDA is not installed on host
huebner-m May 20, 2022
2ba47e4
Add README for GPU support
huebner-m Jun 2, 2022
ac268b1
Iterate over compat libs until we find something that works
Jun 3, 2022
bb5301b
Don't use source when we don't need to
Jun 3, 2022
75ce850
Merge pull request #1 from ocaisa/add_gpu_support
huebner-m Jun 8, 2022
17b7662
Small adjustments to make things work on Debian10, remove debug state…
huebner-m Jun 8, 2022
03b01f1
Make installed CUDA version configurable via env var with a default
huebner-m Jun 8, 2022
dadb170
Use generic latest symlink when sourcing init/bash instead specific v…
huebner-m Jun 8, 2022
0f5884f
Implement suggested changes (don't source when not needed, update REA…
huebner-m Jun 13, 2022
5f2c1f6
Add exit code and more detailed message when installing without GPUs
huebner-m Jun 17, 2022
efe5f88
Merge branch 'main' of github.com:EESSI/software-layer into add_gpu_s…
huebner-m Jul 19, 2022
ab95873
Update error message when nvidia-smi returns an error code
huebner-m Jul 19, 2022
63fded6
Convert OS version for Ubuntu systems when getting CUDA compat libs
huebner-m Jul 25, 2022
d3cadb5
Use rpm files for all OSes and p7zip to unpack them
huebner-m Jul 27, 2022
81e4135
Rename driver_version to driver_major_version
huebner-m Jul 27, 2022
ec9dd69
Prepare shipping CUDA module file with EESSI
huebner-m Jul 29, 2022
0c1004d
Remove loading of CUDA specific module locations
huebner-m Sep 7, 2022
7a9827b
Check for full CUDA software path (incl. version) when loading the mo…
huebner-m Sep 7, 2022
6e86649
Refine install of p7zip, keep it until software layer provides it
huebner-m Sep 7, 2022
e38391b
Prepend lmod module path only if the dir actually exists
huebner-m Sep 13, 2022
b2a4865
Make printout of CUDA installation more accurate
huebner-m Sep 13, 2022
fb73d12
Only install CUDA module in tmpdir if it's already shipped in EESSI
huebner-m Sep 13, 2022
8c8a227
Load correct module env as long as p7zip is not part of software layer
huebner-m Sep 13, 2022
f90dd66
Load CUDA version specified to for installation when testing
huebner-m Sep 13, 2022
a24e09c
Load GCCcore module when building test executable for CUDA
huebner-m Sep 13, 2022
3d0ebad
Add EasyBuild configuration for p7zip installation
huebner-m Sep 13, 2022
fe1843f
Ship whitelisted CUDA libs and rework scripts accordingly
huebner-m Sep 27, 2022
70e5dec
If EULA file exists, CUDA is inst. in host_injections + some fixes
huebner-m Sep 29, 2022
1075d0b
Add check if CUDA compat lib version is sufficient for module
huebner-m Sep 29, 2022
d65fe30
Pass CUDA version from eb hook to compat lib script + fix test dir rm
huebner-m Sep 30, 2022
2ac2671
Update documentation and merge both Lmod load hooks
huebner-m Oct 20, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 38 additions & 1 deletion eb_hooks.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from easybuild.tools.systemtools import AARCH64, POWER, get_cpu_architecture

EESSI_RPATH_OVERRIDE_ATTR = 'orig_rpath_override_dirs'

CUDA_ENABLED_TOOLCHAINS = ["fosscuda", "gcccuda", "gimpic", "giolfc", "gmklc", "golfc", "gomklc", "gompic", "goolfc", "iccifortcuda", "iimklc", "iimpic", "intelcuda", "iomklc", "iompic", "nvompic", "nvpsmpic"]

def get_eessi_envvar(eessi_envvar):
"""Get an EESSI environment variable from the environment"""
Expand Down Expand Up @@ -41,13 +41,32 @@ def get_rpath_override_dirs(software_name):

return rpath_injection_dirs

def inject_gpu_property(ec):
ec_dict = ec.asdict()
# Check if CUDA is in the dependencies, if so add the GPU Lmod tag
if (
"CUDA" in [dep[0] for dep in iter(ec_dict["dependencies"])]
or ec_dict["toolchain"]["name"] in CUDA_ENABLED_TOOLCHAINS
):
key = "modluafooter"
value = 'add_property("arch","gpu")'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think gpu is a recognised property in Lmod so a good choice for now. Once we add AMD support it will get more complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a new property by extending the property table propT. To do so, we could add a file init/lmodrc.lua with a new property. This file can be loaded using the env var $LMOD_RC. Unfortunately, we do not seem to be able to add entries to arch but rather have to add a new property (or find a way to extend arch that I'm missing).

if key in ec_dict:
if not value in ec_dict[key]:
ec[key] = "\n".join([ec_dict[key], value])
else:
ec[key] = value
ec.log.info("[parse hook] Injecting gpu as Lmod arch property")

return ec

def parse_hook(ec, *args, **kwargs):
"""Main parse hook: trigger custom functions based on software name."""

# determine path to Prefix installation in compat layer via $EPREFIX
eprefix = get_eessi_envvar('EPREFIX')

ec = inject_gpu_property(ec)

if ec.name in PARSE_HOOKS:
PARSE_HOOKS[ec.name](ec, eprefix)

Expand Down Expand Up @@ -103,6 +122,24 @@ def cgal_toolchainopts_precise(ec, eprefix):
raise EasyBuildError("CGAL-specific hook triggered for non-CGAL easyconfig?!")


def pre_fetch_hook(self, *args, **kwargs):
"""Modify install path for CUDA software."""
if self.name == 'CUDA':
self.installdir = self.installdir.replace('versions', 'host_injections')


def pre_module_hook(self, *args, **kwargs):
"""Modify install path for CUDA software."""
if self.name == 'CUDA':
self.installdir = self.installdir.replace('versions', 'host_injections')


def pre_sanitycheck_hook(self, *args, **kwargs):
"""Modify install path for CUDA software."""
if self.name == 'CUDA':
self.installdir = self.installdir.replace('versions', 'host_injections')


def fontconfig_add_fonts(ec, eprefix):
"""Inject --with-add-fonts configure option for fontconfig."""
if ec.name == 'fontconfig':
Expand Down
10 changes: 10 additions & 0 deletions gpu_support/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# How to add GPU support
The collection of scripts in this directory enables you to add GPU support to your setup.
Note that currently this means that CUDA support can be added for Nvidia GPUs. AMD GPUs are not yet supported (feel free to contribute that though!).
To enable the usage of CUDA in your setup, simply run the following script:
```
./add_nvidia_gpu_support.sh
```
## Prerequisites and tips
* You need write permissions to `/cvmfs/pilot.eessi-hpc.org/host_injections` (which by default is a symlink to `/opt/eessi` but can be configured in your CVMFS config file to point somewhere else). If you would like to make a system-wide installation you should change this in your configuration to point somewhere on a shared filesystem.
* If you want to install CUDA on a node without GPUs (e.g. on a login node where you want to be able to compile your CUDA-enabled code), you should `export INSTALL_WO_GPU=true` in order to skip checks and tests that can only succeed if you have access to a GPU. This approach is not recommended as there is a chance the CUDA compatibility library installed is not compatible with the existing CUDA driver on GPU nodes (and this will not be detected).
14 changes: 14 additions & 0 deletions gpu_support/add_amd_gpu_support.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash

cat << EOF
ocaisa marked this conversation as resolved.
Show resolved Hide resolved
This is not implemented yet :(

If you would like to contribute this support there are a few things you will
need to consider:
- We will need to change the Lmod property added to GPU software so we can
distinguish AMD and Nvidia GPUs
- Support should be implemented in user space, if this is not possible (e.g.,
requires a driver update) you need to tell the user what to do
- Support needs to be _verified_ and a trigger put in place (like the existence
of a particular path) so we can tell Lmod to display the associated modules
EOF
204 changes: 204 additions & 0 deletions gpu_support/add_nvidia_gpu_support.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
#!/bin/bash

# Drop into the prefix shell or pipe this script into a Prefix shell with
ocaisa marked this conversation as resolved.
Show resolved Hide resolved
# $EPREFIX/startprefix <<< /path/to/this_script.sh

install_cuda_version="${INSTALL_CUDA_VERSION:=11.3.1}"
install_p7zip_version="${INSTALL_P7ZIP_VERSION:=17.04-GCCcore-10.3.0}"

# If you want to install CUDA support on login nodes (typically without GPUs),
# set this variable to true. This will skip all GPU-dependent checks
install_wo_gpu=false
ocaisa marked this conversation as resolved.
Show resolved Hide resolved
[ "$INSTALL_WO_GPU" = true ] && install_wo_gpu=true

# verify existence of nvidia-smi or this is a waste of time
# Check if nvidia-smi exists and can be executed without error
if [[ "${install_wo_gpu}" != "true" ]]; then
if command -v nvidia-smi > /dev/null 2>&1; then
nvidia-smi > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "nvidia-smi was found but returned error code, exiting now..." >&2
echo "If you do not have a GPU on this device but wish to force the installation,"
echo "please set the environment variable INSTALL_WO_GPU=true"
exit 1
fi
echo "nvidia-smi found, continue setup."
else
echo "nvidia-smi not found, exiting now..." >&2
echo "If you do not have a GPU on this device but wish to force the installation,"
echo "please set the environment variable INSTALL_WO_GPU=true"
exit 1
ocaisa marked this conversation as resolved.
Show resolved Hide resolved
fi
else
echo "You requested to install CUDA without GPUs present."
echo "This means that all GPU-dependent tests/checks will be skipped!"
fi

EESSI_SILENT=1 source /cvmfs/pilot.eessi-hpc.org/latest/init/bash

##############################################################################################
# Check that the CUDA driver version is adequate
# (
# needs to be r450 or r470 which are LTS, other production branches are acceptable but not
# recommended, below r450 is not compatible [with an exception we will not explore,see
# https://docs.nvidia.com/datacenter/tesla/drivers/#cuda-drivers]
# )
# only check first number in case of multiple GPUs
if [[ "${install_wo_gpu}" != "true" ]]; then
driver_major_version=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | tail -n1)
driver_major_version="${driver_major_version%%.*}"
# Now check driver_version for compatability
# Check driver is at least LTS driver R450, see https://docs.nvidia.com/datacenter/tesla/drivers/#cuda-drivers
if (( $driver_major_version < 450 )); then
echo "Your NVIDIA driver version is too old, please update first.."
exit 1
fi
fi

###############################################################################################
###############################################################################################
# Install CUDA
# TODO: Can we do a trimmed install?
# if modules dir exists, load it for usage within Lmod
cuda_install_dir="${EESSI_SOFTWARE_PATH/versions/host_injections}"
mkdir -p ${cuda_install_dir}
# only install CUDA if specified version is not found
if [ -d ${cuda_install_dir}/software/CUDA/${install_cuda_version} ]; then
echo "CUDA module found! No need to install CUDA again, proceeding with tests"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite accurate, you've tested for the software here, not the module

else
# - as an installation location just use $EESSI_SOFTWARE_PATH but replacing `versions` with `host_injections`
# (CUDA is a binary installation so no need to worry too much about this)
# TODO: The install is pretty fat, you need lots of space for download/unpack/install (~3*5GB), need to do a space check before we proceed
avail_space=$(df --output=avail ${cuda_install_dir}/ | tail -n 1 | awk '{print $1}')
if (( ${avail_space} < 16000000 )); then
echo "Need more disk space to install CUDA, exiting now..."
exit 1
fi
# install cuda in host_injections
module load EasyBuild
# we need the --rebuild option, since the module file is shipped with EESSI
tmpdir=$(mktemp -d)
eb --rebuild --installpath-modules=${tmpdir} --installpath=${cuda_install_dir}/ CUDA-${install_cuda_version}.eb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this makes testing difficult as the actual CUDA module in EESSI is not available. You might be better off here checking if the CUDA module exists and if so prefixing this command with EASYBUILD_INSTALLPATH_MODULES=${tmpdir}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# we need the --rebuild option, since the module file is shipped with EESSI
tmpdir=$(mktemp -d)
eb --rebuild --installpath-modules=${tmpdir} --installpath=${cuda_install_dir}/ CUDA-${install_cuda_version}.eb
# we need the --rebuild option and a random dir for the module if the module file is shipped with EESSI
if [ -f ${EESSI_SOFTWARE_PATH}/modules/all/CUDA/${install_cuda_version}.lua ]; then
tmpdir=$(mktemp -d)
extra_args="--rebuild --installpath-modules=${tmpdir}"
fi
eb ${extra_args} --installpath=${cuda_install_dir}/ CUDA-${install_cuda_version}.eb

ret=$?
if [ $ret -ne 0 ]; then
echo "CUDA installation failed, please check EasyBuild logs..."
exit 1
fi
rm -rf ${tmpdir}
fi

# Install p7zip, this will be used to install the CUDA compat libraries from rpm.
# The rpm and deb files contain the same libraries, so we just stick to the rpm version.
# If p7zip is missing from the software layer (for whatever reason), we need to install it.
# This has to happen in host_injections, so we check first if it is already installed there.
module use ${cuda_install_dir}/modules/all/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do this conditionally (i.e, only if this directory exists)

module avail 2>&1 | grep -i p7zip &> /dev/null
if [[ $? -eq 0 ]]; then
echo "p7zip module found! No need to install p7zip again, proceeding with installation of compat libraries"
else
ocaisa marked this conversation as resolved.
Show resolved Hide resolved
# install p7zip in host_injections
module load EasyBuild
eb --robot --installpath=${cuda_install_dir}/ p7zip-${install_p7zip_version}.eb
ocaisa marked this conversation as resolved.
Show resolved Hide resolved
ret=$?
if [ $ret -ne 0 ]; then
echo "p7zip installation failed, please check EasyBuild logs..."
exit 1
fi
fi

# Check if the CUDA compat libraries are installed and compatible with the target CUDA version
# if not find the latest version of the compatibility libraries and install them

# get URL to latest CUDA compat libs, exit if URL is invalid
cuda_compat_urls="$($(dirname "$BASH_SOURCE")/get_cuda_compatlibs.sh)"
ret=$?
if [ $ret -ne 0 ]; then
echo $cuda_compat_urls
exit 1
fi

# loop over the compat library versions until we get one that works for us
keep_driver_check=1
# Do a maximum of five attempts
for value in {1..5}
do
latest_cuda_compat_url=$(echo $cuda_compat_urls | cut -d " " -f1)
# Chomp that value out of the list
cuda_compat_urls=$(echo $cuda_compat_urls | cut -d " " -f2-)
latest_driver_version="${latest_cuda_compat_url%-*}"
latest_driver_version="${latest_driver_version##*-}"
# URLs differ for different OSes; check if we already have a number, if not remove string part that is not needed
if [[ ! $latest_driver_version =~ ^[0-9]+$ ]]; then
latest_driver_version="${latest_driver_version##*_}"
fi

install_compat_libs=false
host_injections_dir="/cvmfs/pilot.eessi-hpc.org/host_injections/nvidia"
# libcuda.so points to actual cuda compat lib with driver version in its name
# if this file exists, cuda compat libs are installed and we can compare the version
if [ -e $host_injections_dir/latest/compat/libcuda.so ]; then
eessi_driver_version=$( realpath $host_injections_dir/latest/compat/libcuda.so)
eessi_driver_version="${eessi_driver_version##*so.}"
else
eessi_driver_version=0
fi

if [ $keep_driver_check -eq 1 ]
then
# only keep the driver check for the latest version
keep_driver_check=0
else
eessi_driver_version=0
fi

if [ ${latest_driver_version//./} -gt ${eessi_driver_version//./} ]; then
install_compat_libs=true
else
echo "CUDA compat libs are up-to-date, skip installation."
fi

if [ "${install_compat_libs}" == true ]; then
bash $(dirname "$BASH_SOURCE")/install_cuda_compatlibs.sh $latest_cuda_compat_url
fi

if [[ "${install_wo_gpu}" != "true" ]]; then
bash $(dirname "$BASH_SOURCE")/test_cuda.sh
if [ $? -eq 0 ]
then
exit 0
else
echo
echo "It looks like your driver is not recent enough to work with that release of CUDA, consider updating!"
echo "I'll try an older release to see if that will work..."
echo
fi
else
echo "Requested to install CUDA without GPUs present, so we skip final tests."
echo "Instead we test if module load CUDA works as expected..."
if [ -d ${cuda_install_dir}/modules/all ]; then
module use ${cuda_install_dir}/modules/all/
else
echo "Cannot load CUDA, modules path does not exist, exiting now..."
exit 1
fi
module load CUDA
ret=$?
if [ $ret -ne 0 ]; then
echo "Could not load CUDA even though modules path exists..."
exit 1
else
echo "Successfully loaded CUDA, you are good to go! :)"
echo " - To build CUDA enabled modules use ${EESSI_SOFTWARE_PATH/versions/host_injections} as your EasyBuild prefix"
echo " - To use these modules:"
echo " module use ${EESSI_SOFTWARE_PATH/versions/host_injections}/modules/all/"
echo " - Please keep in mind that we just installed the latest CUDA compat libs."
echo " Since we have no GPU to test with, we cannot guarantee that it will work with the installed CUDA drivers on your GPU node(s)."
exit 0
fi
break
fi
done

echo "Tried to install 5 different generations of compat libraries and none worked,"
echo "this usually means your driver is very out of date!"
exit 1
21 changes: 21 additions & 0 deletions gpu_support/get_cuda_compatlibs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash

current_dir=$(dirname $(realpath $0))

# Get arch type from EESSI environment
if [[ -z "${EESSI_CPU_FAMILY}" ]]; then
# set up basic environment variables, EasyBuild and Lmod
EESSI_SILENT=1 source /cvmfs/pilot.eessi-hpc.org/latest/init/bash
fi
eessi_cpu_family="${EESSI_CPU_FAMILY:-x86_64}"

# build URL for CUDA libraries
# take rpm file for compat libs from rhel8 folder, deb and rpm files contain the same libraries
cuda_url="https://developer.download.nvidia.com/compute/cuda/repos/rhel8/"${eessi_cpu_family}"/"
# get all versions in decending order
files=$(curl -s "${cuda_url}" | grep 'cuda-compat' | sed 's/<\/\?[^>]\+>//g' | xargs -n1 | /cvmfs/pilot.eessi-hpc.org/latest/compat/linux/${eessi_cpu_family}/bin/sort -r --version-sort )
if [[ -z "${files// }" ]]; then
echo "Could not find any compat lib files under" ${cuda_url}
exit 1
fi
for file in $files; do echo "${cuda_url}$file"; done
Loading