Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA support #172

Closed
wants to merge 50 commits into from
Closed
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
065efd1
Add scripts to support CUDA
huebner-m Apr 14, 2022
caa43bf
Fix dump location of check whether CUDA module is installed
huebner-m May 12, 2022
d7212a0
Remove setup script, use shipped init script to set env vars etc. ins…
huebner-m May 12, 2022
48f4455
Check return values and path existence in CUDA tests
huebner-m May 13, 2022
c50daa2
Check return value of eb install, improve source of other scripts
huebner-m May 13, 2022
d4e85cc
Use mktemp to create temporary directory to install compat libs
huebner-m May 13, 2022
590e042
Fix echo
huebner-m May 13, 2022
7b9bb49
Replace explicit dir names with variables, check symlink destination
huebner-m May 16, 2022
01844c6
Install CUDA in modified version of EESSI_SOFTWARE_PATH
huebner-m May 16, 2022
0e8861f
If CUDA install dir exists, add it to EESSI_MODULE_PATH
huebner-m May 16, 2022
7d6af69
Use env var to check for GPU support and add this to module path
huebner-m May 16, 2022
2cc5ce9
Move (conditional) installation of cuda compat libs to external script
huebner-m May 18, 2022
d53e80e
Consistently use EESSI_SITE_MODULEPATH to set up GPU support for Lmod
huebner-m May 18, 2022
850c20e
Rename script to add (NVIDIA) GPU support, add dummy script for AMD GPUs
huebner-m May 19, 2022
9b2e72f
Add shebang
huebner-m May 19, 2022
5f82658
Add option to disable checks, enables installation on nodes w/o GPUs
huebner-m May 19, 2022
16e87af
Allow using an environment variable to skip GPU checks
huebner-m May 19, 2022
cf65a37
Update list of CUDA enabled toolchains
huebner-m May 19, 2022
7319db2
Tell users to use the updated path to enable CUDA support
huebner-m May 19, 2022
6537725
Add protection against warning if CUDA is not installed on host
huebner-m May 20, 2022
2ba47e4
Add README for GPU support
huebner-m Jun 2, 2022
ac268b1
Iterate over compat libs until we find something that works
Jun 3, 2022
bb5301b
Don't use source when we don't need to
Jun 3, 2022
75ce850
Merge pull request #1 from ocaisa/add_gpu_support
huebner-m Jun 8, 2022
17b7662
Small adjustments to make things work on Debian10, remove debug state…
huebner-m Jun 8, 2022
03b01f1
Make installed CUDA version configurable via env var with a default
huebner-m Jun 8, 2022
dadb170
Use generic latest symlink when sourcing init/bash instead specific v…
huebner-m Jun 8, 2022
0f5884f
Implement suggested changes (don't source when not needed, update REA…
huebner-m Jun 13, 2022
5f2c1f6
Add exit code and more detailed message when installing without GPUs
huebner-m Jun 17, 2022
efe5f88
Merge branch 'main' of github.com:EESSI/software-layer into add_gpu_s…
huebner-m Jul 19, 2022
ab95873
Update error message when nvidia-smi returns an error code
huebner-m Jul 19, 2022
63fded6
Convert OS version for Ubuntu systems when getting CUDA compat libs
huebner-m Jul 25, 2022
d3cadb5
Use rpm files for all OSes and p7zip to unpack them
huebner-m Jul 27, 2022
81e4135
Rename driver_version to driver_major_version
huebner-m Jul 27, 2022
ec9dd69
Prepare shipping CUDA module file with EESSI
huebner-m Jul 29, 2022
0c1004d
Remove loading of CUDA specific module locations
huebner-m Sep 7, 2022
7a9827b
Check for full CUDA software path (incl. version) when loading the mo…
huebner-m Sep 7, 2022
6e86649
Refine install of p7zip, keep it until software layer provides it
huebner-m Sep 7, 2022
e38391b
Prepend lmod module path only if the dir actually exists
huebner-m Sep 13, 2022
b2a4865
Make printout of CUDA installation more accurate
huebner-m Sep 13, 2022
fb73d12
Only install CUDA module in tmpdir if it's already shipped in EESSI
huebner-m Sep 13, 2022
8c8a227
Load correct module env as long as p7zip is not part of software layer
huebner-m Sep 13, 2022
f90dd66
Load CUDA version specified to for installation when testing
huebner-m Sep 13, 2022
a24e09c
Load GCCcore module when building test executable for CUDA
huebner-m Sep 13, 2022
3d0ebad
Add EasyBuild configuration for p7zip installation
huebner-m Sep 13, 2022
fe1843f
Ship whitelisted CUDA libs and rework scripts accordingly
huebner-m Sep 27, 2022
70e5dec
If EULA file exists, CUDA is inst. in host_injections + some fixes
huebner-m Sep 29, 2022
1075d0b
Add check if CUDA compat lib version is sufficient for module
huebner-m Sep 29, 2022
d65fe30
Pass CUDA version from eb hook to compat lib script + fix test dir rm
huebner-m Sep 30, 2022
2ac2671
Update documentation and merge both Lmod load hooks
huebner-m Oct 20, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions EESSI-pilot-install-software.sh
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,10 @@ module --force purge
# ignore current $MODULEPATH entirely
module unuse $MODULEPATH
module use $EASYBUILD_INSTALLPATH/modules/all
if [ ! -z "${EESSI_SITE_MODULEPATH}" ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will no longer be necessary once we ship the CUDA module with EESSI (with the actual installation being under host_injections)

echo_green "Add ${EESSI_SITE_MODULEPATH} to \$MODULEPATH for GPU support!"
module use ${EESSI_SITE_MODULEPATH}
fi
if [[ -z ${MODULEPATH} ]]; then
fatal_error "Failed to set up \$MODULEPATH?!"
else
Expand Down
21 changes: 20 additions & 1 deletion eb_hooks.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from easybuild.tools.systemtools import AARCH64, POWER, get_cpu_architecture

EESSI_RPATH_OVERRIDE_ATTR = 'orig_rpath_override_dirs'

CUDA_ENABLED_TOOLCHAINS = ["pmvmklc", "gmvmklc", "gmvapich2c", "pmvapich2c"]
ocaisa marked this conversation as resolved.
Show resolved Hide resolved
ocaisa marked this conversation as resolved.
Show resolved Hide resolved

def get_eessi_envvar(eessi_envvar):
"""Get an EESSI environment variable from the environment"""
Expand Down Expand Up @@ -41,13 +41,32 @@ def get_rpath_override_dirs(software_name):

return rpath_injection_dirs

def inject_gpu_property(ec):
ec_dict = ec.asdict()
# Check if CUDA is in the dependencies, if so add the GPU Lmod tag
if (
"CUDA" in [dep[0] for dep in iter(ec_dict["dependencies"])]
or ec_dict["toolchain"]["name"] in CUDA_ENABLED_TOOLCHAINS
):
key = "modluafooter"
value = 'add_property("arch","gpu")'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think gpu is a recognised property in Lmod so a good choice for now. Once we add AMD support it will get more complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a new property by extending the property table propT. To do so, we could add a file init/lmodrc.lua with a new property. This file can be loaded using the env var $LMOD_RC. Unfortunately, we do not seem to be able to add entries to arch but rather have to add a new property (or find a way to extend arch that I'm missing).

if key in ec_dict:
if not value in ec_dict[key]:
ec[key] = "\n".join([ec_dict[key], value])
else:
ec[key] = value
ec.log.info("[parse hook] Injecting gpu as Lmod arch property")

return ec

def parse_hook(ec, *args, **kwargs):
"""Main parse hook: trigger custom functions based on software name."""

# determine path to Prefix installation in compat layer via $EPREFIX
eprefix = get_eessi_envvar('EPREFIX')

ec = inject_gpu_property(ec)

if ec.name in PARSE_HOOKS:
PARSE_HOOKS[ec.name](ec, eprefix)

Expand Down
14 changes: 14 additions & 0 deletions gpu_support/add_amd_gpu_support.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash

cat << EOF
ocaisa marked this conversation as resolved.
Show resolved Hide resolved
This is not implemented yet :(

If you would like to contribute this support there are a few things you will
need to consider:
- We will need to change the Lmod property added to GPU software so we can
distinguish AMD and Nvidia GPUs
- Support should be implemented in user space, if this is not possible (e.g.,
requires a driver update) you need to tell the user what to do
- Support needs to be _verified_ and a trigger put in place (like the existence
of a particular path) so we can tell Lmod to display the associated modules
EOF
190 changes: 190 additions & 0 deletions gpu_support/add_nvidia_gpu_support.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
#!/bin/bash

# Drop into the prefix shell or pipe this script into a Prefix shell with
ocaisa marked this conversation as resolved.
Show resolved Hide resolved
# $EPREFIX/startprefix <<< /path/to/this_script.sh

# If you want to install CUDA support on login nodes (typically without GPUs),
# set this variable to true. This will skip all GPU-dependent checks
install_wo_gpu=false
ocaisa marked this conversation as resolved.
Show resolved Hide resolved
[ "$INSTALL_WO_GPU" = true ] && install_wo_gpu=true

# verify existence of nvidia-smi or this is a waste of time
# Check if nvidia-smi exists and can be executed without error
if [[ "${install_wo_gpu}" != "true" ]]; then
if command -v nvidia-smi > /dev/null 2>&1; then
nvidia-smi > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "nvidia-smi was found but returned error code, exiting now..." >&2
exit 1
fi
echo "nvidia-smi found, continue setup."
else
echo "nvidia-smi not found, exiting now..." >&2
echo "If you do not have a GPU on this device but wish to force the installation,"
echo "please set the environment variable INSTALL_WO_GPU=true"
exit 1
ocaisa marked this conversation as resolved.
Show resolved Hide resolved
fi
else
echo "You requested to install CUDA without GPUs present."
echo "This means that all GPU-dependent tests/checks will be skipped!"
fi

# set up basic environment variables, EasyBuild and Lmod
EESSI_SILENT=1 source /cvmfs/pilot.eessi-hpc.org/versions/2021.12/init/bash

current_dir=$(dirname $(realpath $0))

# Get arch type from EESSI environment
eessi_cpu_family="${EESSI_CPU_FAMILY:-x86_64}"

# Get OS family
# TODO: needs more thorough testing
os_family=$(uname | tr '[:upper:]' '[:lower:]')

# Get OS version
# TODO: needs more thorough testing, taken from https://unix.stackexchange.com/a/6348
if [ -f /etc/os-release ]; then
# freedesktop.org and systemd
. /etc/os-release
os=$NAME
ver=$VERSION_ID
if [[ "$os" == *"Rocky"* ]]; then
os="rhel"
fi
if [[ "$os" == *"Debian"* ]]; then
os="debian"
fi
elif type lsb_release >/dev/null 2>&1; then
# linuxbase.org
os=$(lsb_release -si)
ver=$(lsb_release -sr)
elif [ -f /etc/lsb-release ]; then
# For some versions of Debian/Ubuntu without lsb_release command
. /etc/lsb-release
os=$DISTRIB_ID
ver=$DISTRIB_RELEASE
elif [ -f /etc/debian_version ]; then
# Older Debian/Ubuntu/etc.
os=Debian
ver=$(cat /etc/debian_version)
else
# Fall back to uname, e.g. "Linux <version>", also works for BSD, etc.
os=$(uname -s)
ver=$(uname -r)
fi
# Convert OS version to major versions, e.g. rhel8.5 -> rhel8
# TODO: needs testing for e.g. Ubuntu 20.04
ver=${ver%.*}

##############################################################################################
# Check that the CUDA driver version is adequate
# (
# needs to be r450 or r470 which are LTS, other production branches are acceptable but not
# recommended, below r450 is not compatible [with an exception we will not explore,see
# https://docs.nvidia.com/datacenter/tesla/drivers/#cuda-drivers]
# )
# only check first number in case of multiple GPUs
if [[ "${install_wo_gpu}" != "true" ]]; then
driver_version=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | tail -n1)
driver_version="${driver_version%%.*}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be driver_major_version

# Now check driver_version for compatability
# Check driver is at least LTS driver R450, see https://docs.nvidia.com/datacenter/tesla/drivers/#cuda-drivers
if (( $driver_version < 450 )); then
echo "Your NVIDIA driver version is too old, please update first.."
exit 1
fi
fi


# Check if the CUDA compat libraries are installed and compatible with the target CUDA version
# if not find the latest version of the compatibility libraries and install them

# get URL to latest CUDA compat libs, exit if URL is invalid
latest_cuda_compat_url="$($(dirname "$BASH_SOURCE")/get_latest_cuda_compatlibs.sh ${os} ${ver} ${eessi_cpu_family})"
ret=$?
if [ $ret -ne 0 ]; then
echo $latest_cuda_compat_url
exit 1
fi
latest_driver_version="${latest_cuda_compat_url%-*}"
latest_driver_version="${latest_driver_version##*_}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
latest_driver_version="${latest_driver_version##*_}"
latest_driver_version="${latest_driver_version##*-}"

I needed this change for this to work for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been addressed in your PR (75ce850) with a small modification in commit 17b7662


install_compat_libs=false
host_injections_dir="/cvmfs/pilot.eessi-hpc.org/host_injections/nvidia"
# libcuda.so points to actual cuda compat lib with driver version in its name
# if this file exists, cuda compat libs are installed and we can compare the version
if [ -e $host_injections_dir/latest/compat/libcuda.so ]; then
eessi_driver_version=$( realpath $host_injections_dir/latest/compat/libcuda.so)
eessi_driver_version="${eessi_driver_version##*so.}"
else
eessi_driver_version=0
fi

if [ ${latest_driver_version//./} -gt ${eessi_driver_version//./} ]; then
install_compat_libs=true
else
echo "CUDA compat libs are up-to-date, skip installation."
fi

if [ "${install_compat_libs}" == true ]; then
source $(dirname "$BASH_SOURCE")/install_cuda_compatlibs.sh $latest_cuda_compat_url
fi

###############################################################################################
###############################################################################################
# Install CUDA
# TODO: Can we do a trimmed install?
# if modules dir exists, load it for usage within Lmod
cuda_install_dir="${EESSI_SOFTWARE_PATH/versions/host_injections}"
mkdir -p ${cuda_install_dir}
if [ -d ${cuda_install_dir}/modules/all ]; then
module use ${cuda_install_dir}/modules/all
fi
# only install CUDA if specified version is not found
install_cuda_version="11.3.1"
module avail 2>&1 | grep -i CUDA/${install_cuda_version} &> /dev/null
if [[ $? -eq 0 ]]; then
echo "CUDA module found! No need to install CUDA again, proceeding with tests"
else
# - as an installation location just use $EESSI_SOFTWARE_PATH but replacing `versions` with `host_injections`
# (CUDA is a binary installation so no need to worry too much about this)
# TODO: The install is pretty fat, you need lots of space for download/unpack/install (~3*5GB), need to do a space check before we proceed
avail_space=$(df --output=avail ${cuda_install_dir}/ | tail -n 1 | awk '{print $1}')
if (( ${avail_space} < 16000000 )); then
echo "Need more disk space to install CUDA, exiting now..."
exit 1
fi
# install cuda in host_injections
module load EasyBuild
eb --installpath=${cuda_install_dir}/ CUDA-${install_cuda_version}.eb
ret=$?
if [ $ret -ne 0 ]; then
echo "CUDA installation failed, please check EasyBuild logs..."
exit 1
fi
fi

cd $current_dir
if [[ "${install_wo_gpu}" != "true" ]]; then
source $(dirname "$BASH_SOURCE")/test_cuda
else
ocaisa marked this conversation as resolved.
Show resolved Hide resolved
echo "Requested to install CUDA without GPUs present, so we skip final tests."
echo "Instead we test if module load CUDA works as expected..."
if [ -d ${cuda_install_dir}/modules/all ]; then
module use ${cuda_install_dir}/modules/all/
else
echo "Cannot load CUDA, modules path does not exist, exiting now..."
exit 1
fi
module load CUDA
ret=$?
if [ $ret -ne 0 ]; then
echo "Could not load CUDA even though modules path exists..."
exit 1
else
echo "Successfully loaded CUDA, you are good to go! :)"
echo " - To build CUDA enabled modules use /cvmfs/pilot.eessi-hpc.org/host_injections/nvidia/ as your EasyBuild prefix"
echo " - To use these modules:"
echo " module use /cvmfs/pilot.eessi-hpc.org/host_injections/nvidia/modules/all/"
fi
fi
21 changes: 21 additions & 0 deletions gpu_support/get_latest_cuda_compatlibs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash

os=$1
ver=$2
eessi_cpu_family=$3

# build URL for CUDA libraries
cuda_url="https://developer.download.nvidia.com/compute/cuda/repos/"${os}${ver}"/"${eessi_cpu_family}"/"
# get latest version, files are sorted by date
# TODO: probably better to explicitly check version numbers than trusting that it is sorted
latest_file=$(curl -s "${cuda_url}" | grep 'cuda-compat' | tail -1)
if [[ -z "${latest_file// }" ]]; then
echo "Could not find any compat lib files under" ${cuda_url}
exit 1
fi
# extract actual file name from html snippet
file=$(echo $latest_file | sed 's/<\/\?[^>]\+>//g')
# build final URL for wget
cuda_url="${cuda_url}$file"
# simply echo the URL, result will be used by add_gpu_support.sh
echo $cuda_url
73 changes: 73 additions & 0 deletions gpu_support/install_cuda_compatlibs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#!/bin/bash

libs_url=$1

current_dir=$(dirname $(realpath $0))

# Create a general space for our NVIDIA compat drivers
if [ -w /cvmfs/pilot.eessi-hpc.org/host_injections ]; then
mkdir -p ${host_injections_dir}
cd ${host_injections_dir}
else
echo "Cannot write to eessi host_injections space, exiting now..." >&2
exit 1
fi

# Check if we have any version installed by checking for the existence of /cvmfs/pilot.eessi-hpc.org/host_injections/nvidia/latest

driver_cuda_version=$(nvidia-smi -q --display=COMPUTE | grep CUDA | awk 'NF>1{print $NF}' | sed s/\\.//)
eessi_cuda_version=$(LD_LIBRARY_PATH=${host_injections_dir}/latest/compat/:$LD_LIBRARY_PATH nvidia-smi -q --display=COMPUTE | grep CUDA | awk 'NF>1{print $NF}' | sed s/\\.//)
if [ "$driver_cuda_version" -gt "$eessi_cuda_version" ]; then echo "You need to update your CUDA compatability libraries"; fi

# Check if our target CUDA is satisfied by what is installed already
# TODO: Find required CUDA version and see if we need an update

# If not, grab the latest compat library RPM or deb
# download and unpack in temporary directory, easier cleanup after installation
tmpdir=$(mktemp -d)
cd $tmpdir
compat_file=${libs_url##*/}
wget ${libs_url}

# Unpack it
# (the requirements here are OS dependent, can we get around that?)
# (for rpms looks like we can use https://gitweb.gentoo.org/repo/proj/prefix.git/tree/eclass/rpm.eclass?id=d7fc8cf65c536224bace1d22c0cd85a526490a1e)
# (deb files can be unpacked with ar and tar)
file_extension=${compat_file##*.}
if [[ ${file_extension} == "rpm" ]]; then
rpm2cpio ${compat_file} | cpio -idmv
elif [[ ${file_extension} == "deb" ]]; then
ar x ${compat_file}
tar xf data.tar.*
else
echo "File extension of cuda compat lib not supported, exiting now..." >&2
exit 1
fi
cd $host_injections_dir
# TODO: This would prevent error messages if folder already exists, but could be problematic if only some files are missing in destination dir
mv -n ${tmpdir}/usr/local/cuda-* .
rm -r ${tmpdir}

# Add a symlink that points to the latest version
latest_cuda_dir=$(find . -maxdepth 1 -type d | grep -i cuda | sort | tail -n1)
ln -sf ${latest_cuda_dir} latest

if [ ! -e latest ] ; then
echo "Symlink to latest cuda compat lib version is broken, exiting now..."
exit 1
fi

# Create the space to host the libraries
host_injection_libs_dir=/cvmfs/pilot.eessi-hpc.org/host_injections/${EESSI_PILOT_VERSION}/compat/${os_family}/${eessi_cpu_family}
mkdir -p ${host_injection_libs_dir}
# Symlink in the path to the latest libraries
if [ ! -d "${host_injection_libs_dir}/lib" ]; then
ln -s ${host_injections_dir}/latest/compat ${host_injection_libs_dir}/lib
elif [ ! "${host_injection_libs_dir}/lib" -ef "${host_injections_dir}/latest/compat" ]; then
echo "CUDA compat libs symlink exists but points to the wrong location, please fix this..."
echo "${host_injection_libs_dir}/lib should point to ${host_injections_dir}/latest/compat"
exit 1
fi

# return to initial dir
cd $current_dir
Loading