Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA support to software_layer #212

Closed
wants to merge 35 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
a86c614
Add CUDA support to software_layer
ocaisa Dec 16, 2022
6c41b26
singularity install does not seem to install mksquashfs
ocaisa Dec 16, 2022
7d53b03
Trigger script test
ocaisa Dec 16, 2022
58357b9
Revert
ocaisa Dec 16, 2022
4b6654d
Use the right package name for squash-fs
ocaisa Dec 16, 2022
33ce584
Tidy up hooks
ocaisa Dec 16, 2022
f1cd893
Force creation of links
ocaisa Dec 16, 2022
06a9eaf
Install host_injections CUDA
ocaisa Dec 19, 2022
b4e80a1
Move comments to the right place
ocaisa Dec 19, 2022
2c86973
Reimplement `mkdir -p` reporting where permissions break down
ocaisa Feb 14, 2023
3909080
Merge branch 'main' into p7zip
ocaisa Feb 24, 2023
85c805c
Merge branch 'ocaisa-patch-2' into p7zip
ocaisa Feb 24, 2023
9590047
Be more agressive on catching errors
ocaisa Feb 24, 2023
1357f76
`${extra_args}` is actually multiple args not a single string
ocaisa Feb 27, 2023
8096c54
Update EESSI-pilot-install-software.sh
ocaisa Feb 27, 2023
ec31edf
Catching echo exit code instead of actual code
ocaisa Mar 1, 2023
0e99db5
Give a full path to the CUDA host injections script
ocaisa Mar 1, 2023
cd11792
Add checks for some whitelist entries for CUDA
ocaisa Mar 1, 2023
f514f81
Fix failing eb installation
ocaisa Mar 1, 2023
be326a1
Make sure we check space in the right places
ocaisa Mar 1, 2023
87c17a3
Merge branch 'main' of github.com:eessi/software-layer into p7zip
ocaisa Mar 2, 2023
103f5fa
Simply wrap `mkdir -p` for better error reporting
ocaisa Mar 3, 2023
f02e5f6
Merge branch 'p7zip' of github.com:ocaisa/software-layer into p7zip
ocaisa Mar 3, 2023
793ba29
Simply wrap `mkdir -p` for better error reporting
ocaisa Mar 3, 2023
c0a1247
Make CUDA version a variable
ocaisa Mar 3, 2023
5e82923
Use TOPDIR, be more descriptive
ocaisa Mar 3, 2023
8384b25
Add missing argument
ocaisa Mar 3, 2023
98fe2a7
Improve error messages in new bash function
ocaisa Mar 3, 2023
bbe7df2
Stick with return_code
ocaisa Mar 3, 2023
95dc245
Use realpath to be consistent with other scripts
ocaisa Mar 3, 2023
a1270f2
Wrong realpath flag
ocaisa Mar 3, 2023
aba486d
Wrong realpath flag
ocaisa Mar 3, 2023
d2d1fc3
Fix typo
ocaisa Mar 3, 2023
562e94b
Always add the rebuild option if we get to the point where we actuall…
ocaisa Mar 3, 2023
b4ae5f0
Expose CUDA_TEMP_DIR
ocaisa Mar 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions EESSI-pilot-install-software.sh
Original file line number Diff line number Diff line change
Expand Up @@ -394,6 +394,34 @@ $EB --from-pr 15885 OpenBLAS-0.3.15-GCC-10.3.0.eb --robot
$EB SciPy-bundle-2021.05-foss-2021a.eb -r --buildpath /dev/shm/$USER/easybuild_build
check_exit_code $? "${ok_msg}" "${fail_msg}"

# CUDA support

# install p7zip (to be able to unpack RPMs)
p7zip_ec="p7zip-17.04-GCCcore-10.3.0.eb"
echo ">> Installing $p7zip_ec..."
ok_msg="$p7zip_ec installed, off to a good (?) start!"
fail_msg="Failed to install $p7zip_ec, woopsie..."
$EB $p7zip_ec --robot
check_exit_code $? "${ok_msg}" "${fail_msg}"

# install CUDA (uses eb_hooks.py to only install runtime)
cuda_ec="CUDA-11.3.1.eb"
echo ">> Installing $cuda_ec..."
ok_msg="$cuda_ec installed, off to a good (?) start!"
fail_msg="Failed to install $cuda_ec, woopsie..."
$EB $cuda_ec --robot
check_exit_code $? "${ok_msg}" "${fail_msg}"

# install CUDA samples (requires EESSI support for CUDA)
# TODO Run EESSI NVIDIA GPU support script here
ocaisa marked this conversation as resolved.
Show resolved Hide resolved
# (which unbreaks the symlinks from the runtime installation)
cuda_samples_ec="CUDA-Samples-11.3-GCC-10.3.0-CUDA-11.3.1.eb"
echo ">> Installing $cuda_samples_ec..."
ok_msg="$cuda_ec installed, off to a good (?) start!"
fail_msg="Failed to install $cuda_samples_ec, woopsie..."
$EB $cuda_samples_ec --robot --from-pr=16914
check_exit_code $? "${ok_msg}" "${fail_msg}"

### add packages here

echo ">> Creating/updating Lmod cache..."
Expand Down
265 changes: 185 additions & 80 deletions eb_hooks.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,51 +8,56 @@
from easybuild.tools.systemtools import AARCH64, POWER, X86_64, get_cpu_architecture, get_cpu_features
from easybuild.tools.toolchain.compiler import OPTARCH_GENERIC

EESSI_RPATH_OVERRIDE_ATTR = 'orig_rpath_override_dirs'
EESSI_RPATH_OVERRIDE_ATTR = "orig_rpath_override_dirs"

CUDA_ENABLED_TOOLCHAINS = [
"fosscuda",
"gcccuda",
"gimpic",
"giolfc",
"gmklc",
"golfc",
"gomklc",
"gompic",
"goolfc",
"iccifortcuda",
"iimklc",
"iimpic",
"intelcuda",
"iomklc",
"iompic",
"nvompic",
"nvpsmpic",
]

PARSE_HOOKS = {
"CGAL": cgal_toolchainopts_precise,
"fontconfig": fontconfig_add_fonts,
"UCX": ucx_eprefix,
}

def get_eessi_envvar(eessi_envvar):
"""Get an EESSI environment variable from the environment"""

eessi_envvar_value = os.getenv(eessi_envvar)
if eessi_envvar_value is None:
raise EasyBuildError("$%s is not defined!", eessi_envvar)

return eessi_envvar_value


def get_rpath_override_dirs(software_name):
# determine path to installations in software layer via $EESSI_SOFTWARE_PATH
eessi_software_path = get_eessi_envvar('EESSI_SOFTWARE_PATH')
eessi_pilot_version = get_eessi_envvar('EESSI_PILOT_VERSION')

# construct the rpath override directory stub
rpath_injection_stub = os.path.join(
# Make sure we are looking inside the `host_injections` directory
eessi_software_path.replace(eessi_pilot_version, os.path.join('host_injections', eessi_pilot_version), 1),
# Add the subdirectory for the specific software
'rpath_overrides',
software_name,
# We can't know the version, but this allows the use of a symlink
# to facilitate version upgrades without removing files
'system',
)

# Allow for libraries in lib or lib64
rpath_injection_dirs = [os.path.join(rpath_injection_stub, x) for x in ('lib', 'lib64')]
PRE_CONFIGURE_HOOKS = {
"libfabric": libfabric_disable_psm3_x86_64_generic,
"MetaBAT": metabat_preconfigure,
"WRF": wrf_preconfigure,
}

return rpath_injection_dirs
POST_PACKAGE_HOOKS = {
"CUDA": cuda_postpackage,
}


def parse_hook(ec, *args, **kwargs):
"""Main parse hook: trigger custom functions based on software name."""

# determine path to Prefix installation in compat layer via $EPREFIX
eprefix = get_eessi_envvar('EPREFIX')
eprefix = get_eessi_envvar("EPREFIX")

if ec.name in PARSE_HOOKS:
PARSE_HOOKS[ec.name](ec, eprefix)

ec = inject_gpu_property(ec)


def pre_configure_hook(self, *args, **kwargs):
"""Main pre-configure hook: trigger custom functions based on software name."""
Expand All @@ -74,80 +79,124 @@ def pre_prepare_hook(self, *args, **kwargs):

# update the relevant option (but keep the original value so we can reset it later)
if hasattr(self, EESSI_RPATH_OVERRIDE_ATTR):
raise EasyBuildError("'self' already has attribute %s! Can't use pre_prepare hook.",
EESSI_RPATH_OVERRIDE_ATTR)
raise EasyBuildError(
"'self' already has attribute %s! Can't use pre_prepare hook.", EESSI_RPATH_OVERRIDE_ATTR
)

setattr(self, EESSI_RPATH_OVERRIDE_ATTR, build_option('rpath_override_dirs'))
setattr(self, EESSI_RPATH_OVERRIDE_ATTR, build_option("rpath_override_dirs"))
if getattr(self, EESSI_RPATH_OVERRIDE_ATTR):
# self.EESSI_RPATH_OVERRIDE_ATTR is (already) a colon separated string, let's make it a list
orig_rpath_override_dirs = [getattr(self, EESSI_RPATH_OVERRIDE_ATTR)]
rpath_override_dirs = ':'.join(orig_rpath_override_dirs + mpi_rpath_override_dirs)
rpath_override_dirs = ":".join(orig_rpath_override_dirs + mpi_rpath_override_dirs)
else:
rpath_override_dirs = ':'.join(mpi_rpath_override_dirs)
update_build_option('rpath_override_dirs', rpath_override_dirs)
print_msg("Updated rpath_override_dirs (to allow overriding MPI family %s): %s",
mpi_family, rpath_override_dirs)
rpath_override_dirs = ":".join(mpi_rpath_override_dirs)
update_build_option("rpath_override_dirs", rpath_override_dirs)
print_msg(
"Updated rpath_override_dirs (to allow overriding MPI family %s): %s", mpi_family, rpath_override_dirs
)


def post_prepare_hook(self, *args, **kwargs):
"""Main post-prepare hook: trigger custom functions."""

if hasattr(self, EESSI_RPATH_OVERRIDE_ATTR):
# Reset the value of 'rpath_override_dirs' now that we are finished with it
update_build_option('rpath_override_dirs', getattr(self, EESSI_RPATH_OVERRIDE_ATTR))
update_build_option("rpath_override_dirs", getattr(self, EESSI_RPATH_OVERRIDE_ATTR))
print_msg("Resetting rpath_override_dirs to original value: %s", getattr(self, EESSI_RPATH_OVERRIDE_ATTR))
delattr(self, EESSI_RPATH_OVERRIDE_ATTR)


def pre_configure_hook(self, *args, **kwargs):
"""Main pre-configure hook: trigger custom functions based on software name."""
if self.name in PRE_CONFIGURE_HOOKS:
PRE_CONFIGURE_HOOKS[self.name](self, *args, **kwargs)


def post_package_hook(self, *args, **kwargs):
"""Main post-package hook: trigger custom functions based on software name."""
if self.name in POST_PACKAGE_HOOKS:
POST_PACKAGE_HOOKS[self.name](self, *args, **kwargs)


# Functions used by hooks


def get_eessi_envvar(eessi_envvar):
"""Get an EESSI environment variable from the environment"""

eessi_envvar_value = os.getenv(eessi_envvar)
if eessi_envvar_value is None:
raise EasyBuildError("$%s is not defined!", eessi_envvar)

return eessi_envvar_value


def get_rpath_override_dirs(software_name):
# determine path to installations in software layer via $EESSI_SOFTWARE_PATH
eessi_software_path = get_eessi_envvar("EESSI_SOFTWARE_PATH")
eessi_pilot_version = get_eessi_envvar("EESSI_PILOT_VERSION")

# construct the rpath override directory stub
rpath_injection_stub = os.path.join(
# Make sure we are looking inside the `host_injections` directory
eessi_software_path.replace(eessi_pilot_version, os.path.join("host_injections", eessi_pilot_version), 1),
# Add the subdirectory for the specific software
"rpath_overrides",
software_name,
# We can't know the version, but this allows the use of a symlink
# to facilitate version upgrades without removing files
"system",
)

# Allow for libraries in lib or lib64
rpath_injection_dirs = [os.path.join(rpath_injection_stub, x) for x in ("lib", "lib64")]

return rpath_injection_dirs


def cgal_toolchainopts_precise(ec, eprefix):
"""Enable 'precise' rather than 'strict' toolchain option for CGAL on POWER."""
if ec.name == 'CGAL':
if ec.name == "CGAL":
if get_cpu_architecture() == POWER:
# 'strict' implies '-mieee-fp', which is not supported on POWER
# see https://github.com/easybuilders/easybuild-framework/issues/2077
ec['toolchainopts']['strict'] = False
ec['toolchainopts']['precise'] = True
print_msg("Tweaked toochainopts for %s: %s", ec.name, ec['toolchainopts'])
ec["toolchainopts"]["strict"] = False
ec["toolchainopts"]["precise"] = True
print_msg("Tweaked toochainopts for %s: %s", ec.name, ec["toolchainopts"])
else:
raise EasyBuildError("CGAL-specific hook triggered for non-CGAL easyconfig?!")


def fontconfig_add_fonts(ec, eprefix):
"""Inject --with-add-fonts configure option for fontconfig."""
if ec.name == 'fontconfig':
if ec.name == "fontconfig":
# make fontconfig aware of fonts included with compat layer
with_add_fonts = '--with-add-fonts=%s' % os.path.join(eprefix, 'usr', 'share', 'fonts')
ec.update('configopts', with_add_fonts)
with_add_fonts = "--with-add-fonts=%s" % os.path.join(eprefix, "usr", "share", "fonts")
ec.update("configopts", with_add_fonts)
print_msg("Added '%s' configure option for %s", with_add_fonts, ec.name)
else:
raise EasyBuildError("fontconfig-specific hook triggered for non-fontconfig easyconfig?!")


def ucx_eprefix(ec, eprefix):
"""Make UCX aware of compatibility layer via additional configuration options."""
if ec.name == 'UCX':
ec.update('configopts', '--with-sysroot=%s' % eprefix)
ec.update('configopts', '--with-rdmacm=%s' % os.path.join(eprefix, 'usr'))
print_msg("Using custom configure options for %s: %s", ec.name, ec['configopts'])
if ec.name == "UCX":
ec.update("configopts", "--with-sysroot=%s" % eprefix)
ec.update("configopts", "--with-rdmacm=%s" % os.path.join(eprefix, "usr"))
print_msg("Using custom configure options for %s: %s", ec.name, ec["configopts"])
else:
raise EasyBuildError("UCX-specific hook triggered for non-UCX easyconfig?!")


def pre_configure_hook(self, *args, **kwargs):
"""Main pre-configure hook: trigger custom functions based on software name."""
if self.name in PRE_CONFIGURE_HOOKS:
PRE_CONFIGURE_HOOKS[self.name](self, *args, **kwargs)


def libfabric_disable_psm3_x86_64_generic(self, *args, **kwargs):
"""Add --disable-psm3 to libfabric configure options when building with --optarch=GENERIC on x86_64."""
if self.name == 'libfabric':
if self.name == "libfabric":
if get_cpu_architecture() == X86_64:
generic = build_option('optarch') == OPTARCH_GENERIC
no_avx = 'avx' not in get_cpu_features()
generic = build_option("optarch") == OPTARCH_GENERIC
no_avx = "avx" not in get_cpu_features()
if generic or no_avx:
self.cfg.update('configopts', '--disable-psm3')
print_msg("Using custom configure options for %s: %s", self.name, self.cfg['configopts'])
self.cfg.update("configopts", "--disable-psm3")
print_msg("Using custom configure options for %s: %s", self.name, self.cfg["configopts"])
else:
raise EasyBuildError("libfabric-specific hook triggered for non-libfabric easyconfig?!")

Expand All @@ -158,10 +207,10 @@ def metabat_preconfigure(self, *args, **kwargs):
- take into account that zlib is a filtered dependency,
and that there's no libz.a in the EESSI compat layer
"""
if self.name == 'MetaBAT':
configopts = self.cfg['configopts']
if self.name == "MetaBAT":
configopts = self.cfg["configopts"]
regex = re.compile(r"\$EBROOTZLIB/lib/libz.a")
self.cfg['configopts'] = regex.sub('$EPREFIX/usr/lib64/libz.so', configopts)
self.cfg["configopts"] = regex.sub("$EPREFIX/usr/lib64/libz.so", configopts)
else:
raise EasyBuildError("MetaBAT-specific hook triggered for non-MetaBAT easyconfig?!")

Expand All @@ -171,24 +220,80 @@ def wrf_preconfigure(self, *args, **kwargs):
Pre-configure hook for WRF:
- patch arch/configure_new.defaults so building WRF with foss toolchain works on aarch64
"""
if self.name == 'WRF':
if self.name == "WRF":
if get_cpu_architecture() == AARCH64:
pattern = "Linux x86_64 ppc64le, gfortran"
repl = "Linux x86_64 aarch64 ppc64le, gfortran"
self.cfg.update('preconfigopts', "sed -i 's/%s/%s/g' arch/configure_new.defaults && " % (pattern, repl))
print_msg("Using custom preconfigopts for %s: %s", self.name, self.cfg['preconfigopts'])
self.cfg.update("preconfigopts", "sed -i 's/%s/%s/g' arch/configure_new.defaults && " % (pattern, repl))
print_msg("Using custom preconfigopts for %s: %s", self.name, self.cfg["preconfigopts"])
else:
raise EasyBuildError("WRF-specific hook triggered for non-WRF easyconfig?!")


PARSE_HOOKS = {
'CGAL': cgal_toolchainopts_precise,
'fontconfig': fontconfig_add_fonts,
'UCX': ucx_eprefix,
}

PRE_CONFIGURE_HOOKS = {
'libfabric': libfabric_disable_psm3_x86_64_generic,
'MetaBAT': metabat_preconfigure,
'WRF': wrf_preconfigure,
}
def cuda_post_package(self, *args, **kwargs):
"""Delete CUDA files we are not allowed to ship and replace them with a symlink to a possible installation under host_injections."""
print_msg("Replacing CUDA stuff we cannot ship with symlinks...")
# read CUDA EULA
eula_path = os.path.join(self.installdir, "EULA.txt")
tmp_buffer = []
with open(eula_path) as infile:
copy = False
for line in infile:
if line.strip() == "2.6. Attachment A":
copy = True
continue
elif line.strip() == "2.7. Attachment B":
copy = False
continue
elif copy:
tmp_buffer.append(line)
# create whitelist without file extensions, they're not really needed and they only complicate things
whitelist = []
ocaisa marked this conversation as resolved.
Show resolved Hide resolved
file_extensions = [".so", ".a", ".h", ".bc"]
for tmp in tmp_buffer:
for word in tmp.split():
if any(ext in word for ext in file_extensions):
whitelist.append(word.split(".")[0])
whitelist = list(set(whitelist))
ocaisa marked this conversation as resolved.
Show resolved Hide resolved
# iterate over all files in the CUDA path
for root, dirs, files in os.walk(self.installdir):
for filename in files:
# we only really care about real files, i.e. not symlinks
if not os.path.islink(os.path.join(root, filename)):
# check if the current file is part of the whitelist
basename = filename.split(".")[0]
if basename not in whitelist:
# if it is not in the whitelist, delete the file and create a symlink to host_injections
source = os.path.join(root, filename)
target = source.replace("versions", "host_injections")
os.remove(source)
# Using os.symlink requires the existence of the target directory, so we use os.system
os.system("ln %s %s" % (target, source))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be checking that these are succeeding



def inject_gpu_property(ec):
ec_dict = ec.asdict()
# Check if CUDA is in the dependencies, if so add the GPU Lmod tag
if (
"CUDA" in [dep[0] for dep in iter(ec_dict["dependencies"])]
or ec_dict["toolchain"]["name"] in CUDA_ENABLED_TOOLCHAINS
):
ec.log.info("[parse hook] Injecting gpu as Lmod arch property and envvar with CUDA version")
key = "modluafooter"
value = 'add_property("arch","gpu")'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property allows us to protect the loading of any GPU package via an Lmod hook (which can be overridden): unless the compat libraries are installed you can't load GPU modules

cuda_version = 0
for dep in iter(ec_dict["dependencies"]):
# Make CUDA a build dependency only (rpathing saves us from link errors)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach saves us from explicitly loading CUDA to run a CUDA dependent package. This allows us to write an Lmod hook that protects loading the CUDA module unless certain criteria are met (i.e., that the symlinks are unbroken).

if "CUDA" in dep[0]:
cuda_version = dep[1]
ec_dict["dependencies"].remove(dep)
ec_dict["builddependencies"].append(dep) if dep not in ec_dict["builddependencies"] else ec_dict[
"builddependencies"
]
value = "\n".join([value, 'setenv("EESSICUDAVERSION","%s")' % cuda_version])
if key in ec_dict:
if not value in ec_dict[key]:
ec[key] = "\n".join([ec_dict[key], value])
else:
ec[key] = value
return ec
Loading