Merge pull request #117 from casparvl/interactive_building_for_debugging

Interactive building for debugging
EESSI · Nov 3, 2023 · cfc937e · cfc937e
2 parents 8ae4d33 + e497843
commit cfc937e
Show file tree

Hide file tree

Showing 9 changed files with 364 additions and 81 deletions.
diff --git a/docs/bot.md b/docs/bot.md
@@ -6,7 +6,7 @@ Building, testing, and deploying software is done by one or more *bot instances*
 The EESSI build-test-deploy bot :robot: is implemented as a [GitHub App](https://docs.github.com/en/apps/overview)
 in the [`eessi-bot-software-layer` repository](https://github.com/EESSI/eessi-bot-software-layer).
 
-It operates in the context of [pull requests](software_layer/adding_software.md#software_layer_pull_request) to
+It operates in the context of [pull requests](contributing_sw/opening_pr.md#software_layer_pull_request) to
 the [`compatibility-layer` repository](https://github.com/EESSI/compatibility-layer) or the
 [`software-layer` repository](https://github.com/EESSI/software-layer),
 and follows the instructions supplied by humans,
@@ -61,7 +61,7 @@ to trigger building of software, and to deploy software installations in to the
 ## Building { #building }
 
 To instruct the bot :robot: to build software, one or more `build` instructions
-should be issued by posting a comment in the pull request (see also [here](software_layer/adding_software.md#bot_build)).
+should be issued by posting a comment in the pull request (see also [here](contributing_sw/building_software.md#bot_build)).
 
 The most basic build instruction that can be sent to the bot is:
 

diff --git a/docs/contributing_sw/building_software.md b/docs/contributing_sw/building_software.md
@@ -0,0 +1,76 @@
+# Building software (maintainers)
+
+### Instructing the bot to build :hammer: { #bot_build }
+
+Once the pull request is open, you can instruct the [bot :robot:](../bot.md) to build the software by posting a comment.
+
+For more information, see the [building section in the bot documentation](../bot.md#building).
+
+!!! warning
+    Permission to trigger building of software must be granted to your GitHub account first!
+
+    See [bot permissions](../bot.md#permissions) for more information.
+
+#### Guidelines
+
+* It may be wise to let the bot perform a test build first, rather than letting it build for a wide range
+  of CPU targets.
+
+* If one of the builds failed, you can let the bot retry that specific build.
+
+* Make sure that the software has been built correctly for all [CPU targets](../software_layer/cpu_targets.md) before you deploy!
+
+#### Checking the builds :mag:
+
+If all goes well, you should see `SUCCESS` :grin: for each build, along with button :arrow_down_small:
+to get more information about the checks that were performed, and metadata information on the resulting
+artefact :package:.
+
+!!! note
+    **Make sure the result is what you expect it to be for all builds before you deploy!**
+
+#### Failing builds :no_entry:
+
+!!! warning
+    The bot will currently not give you any information on how or why a build is failing.
+
+    Ask for help in the `#software-layer` channel of the EESSI Slack if needed!
+
+### Instructing the bot to deploy :rocket:
+
+To make the [bot :robot:](../bot.md) deploy the successfully built software, you should
+issue the corresponding instruction to the bot.
+
+For more information, see the [deploying section in the bot documentation](../bot.md#deploying).
+
+!!! warning
+    Permission to trigger deployment of software installations must be granted to your GitHub account first!
+
+    See [bot permissions](../bot.md#permissions) for more information.
+
+### Merging the pull request
+
+You should be able to verify in the pull request that the ingestion has been done,
+since the CI should fail :x: initially to indicate that some software installations listed in
+your modified easystack are missing.
+
+Once the ingestion has been done, simply re-triggering the CI workflow should be sufficient to make it pass
+:white_check_mark:, and then the pull request can be merged.
+
+!!! note
+    This assumes that the easystack file being modified is considered by the CI workflow file
+    (`.github/workflows/test_eessi.yml`) that checks for missing installations, in the correct branch (for example
+    `2023.06`) of the [software-layer](https://github.com/EESSI/software-layer).
+
+    If that's not the case yet, update this workflow in your pull request as well to add the missing easystack file!
+
+!!! warning
+    You need permissions to re-trigger CI workflows and merge pull requests
+    in the [software-layer](https://github.com/EESSI/software-layer) repository.
+
+    Ask for help in the `#software-layer` channel of the EESSI Slack if needed!
+
+### Getting help
+
+If you have any questions, or if you need help with something, don't hesitate to contact us via
+the `#software-layer` channel of the EESSI Slack.
diff --git a/docs/contributing_sw/contribution_policy.md b/docs/contributing_sw/contribution_policy.md
@@ -0,0 +1,3 @@
+# Contribution policy
+
+(coming soon)
diff --git a/docs/contributing_sw/debugging_failed_builds.md b/docs/contributing_sw/debugging_failed_builds.md
@@ -0,0 +1,226 @@
+# Debugging failed builds (contributors/maintainers)
+
+Unfortunately, software does not always build successfully. Since EESSI targets novel CPU architectures as well, build failures on such platforms are quite common, as the software and/or the software build systems have not always been adjusted to support these architectures yet.
+
+In EESSI, all software packages are built by a bot. This is great for builds that complete successfully as we can build many software packages for a wide range of hardware with little human intervention. However, it does mean that you, as contributor, can not easily access the build directory and build logs to figure out build issues.
+
+This page describes how you can interactively reproduce failed builds, so that you can more easily debug the issue.
+
+Throughout this page, we will use [this PR](https://github.com/EESSI/software-layer/pull/360) as an example. It intends to add LAMMPS to EESSI. Among other issues, it failed on a [building Plumed](https://github.com/EESSI/software-layer/pull/360#issuecomment-1765913105).
+
+## Prerequisites
+You will need to have:
+
+- Access to a machine with the hardware for which the build that you want to debug failed. 
+- On that machine, meet the requirements for running the EESSI container, as described on [this page](../getting_access/eessi_container.md#prerequisites).
+
+## Preparing the environment
+A number of steps are needed to create the same environment in which the bot builds.
+
+- Fetching the feature branch from which you want to replicate a build.
+- Starting a shell in the EESSI container.
+- Start the Gentoo Prefix environment.
+- Start the EESSI software environment.
+- Configure EasyBuild.
+
+### Fetching the feature branch
+Looking at [the example PR](https://github.com/EESSI/software-layer/pull/360), we see the PR is created from [this fork](https://github.com/laraPPr/software-layer/). First, we clone the fork, then checkout the feature branch (`LAMMPS_23Jun2022`)
+```
+git clone https://github.com/laraPPr/software-layer/
+cd software-layer
+git checkout LAMMPS_23Jun2022
+```
+Alternatively, if you already have a clone of the `software-layer` you can add it as a new remote
+```
+cd software-layer
+git remote add laraPPr https://github.com/laraPPr/software-layer/
+git fetch laraPPr
+git checkout LAMMPS_23Jun2022
+```
+
+### Starting a shell in the EESSI container
+Simply run the EESSI container (`eessi_container.sh`), which should be in the root of the `software-layer` repository
+```
+./eessi_container.sh --access rw
+```
+!!! Note
+    You may have to press enter to clearly see the prompt as some messages
+    beginning with `CernVM-FS: ` have been printed after the first prompt
+    `Apptainer> ` was shown.
+
+If you want to debug an issue for which a lot of dependencies need to be build first, you may want to start the container with the `--save DIR/TGZ` and flag (check `./eessi_container.sh --help`). This saves the temporary directory (which we will use as working and installation directory later in this instruction) in order to be able to resume later with the same temporary directory. E.g.
+
+```
+./eessi_container.sh --access rw --save ${HOME}/pr370
+```
+The tarball will be saved when you exit the container. Note that the first `exit` command will first make you exit the Gentoo prefix environment. Only the second will take you out of the container, and print where the tarball will be stored:
+```
+[EESSI pilot 2023.06] $ exit
+logout
+Leaving Gentoo Prefix with exit status 1
+Apptainer> exit
+exit
+Saved contents of tmp directory '/tmp/eessi-debug.VgLf1v9gf0' to tarball '${HOME}/pr370/EESSI-pilot-1698056784.tgz' (to resume session add '--resume ${HOME}/pr370/EESSI-pilot-1698056784.tgz')
+```
+
+Note that the tarballs can be quite sizeable, so make sure to pick a filesystem where you have a large enough quotum.
+
+
+Next time you want to continue investigating this issue, you can start the container with `--resume DIR/TGZ` and continue where you left off, having all dependencies already built and available.
+```
+./eessi_container.sh --access rw --resume ${HOME}/pr370/EESSI-pilot-1698056784.tgz
+```
+
+For a detailed description on using the script `eessi_container.sh`, see [here](../getting_access/eessi_container.md).
+
+### Start the Gentoo Prefix environment
+The next step is to start the Gentoo Prefix environment. 
+
+Before we start, check the current values of `${EESSI_CVMFS_REPO}` and `${EESSI_PILOT_VERSION}` so that you can reset them later:
+```
+echo ${EESSI_CVMFS_REPO}
+echo ${EESSI_PILOT_VERSION}
+```
+
+Then, we set `EESSI_OS_TYPE` and `EESSI_CPU_FAMILY` and run the `startprefix` command to start the Gentoo Prefix environment:
+```
+export EESSI_OS_TYPE=linux  # We only support Linux for now
+export EESSI_CPU_FAMILY=$(uname -m)
+${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/compat/${EESSI_OS_TYPE}/${EESSI_CPU_FAMILY}/startprefix
+```
+
+Now, reset the `${EESSI_CVMFS_REPO}` and `${EESSI_PILOT_VERSION}` in your prefix environment with the initial values (printed in the echo statements above)
+```
+export EESSI_CVMFS_REPO=...
+export EESSI_PILOT_VERSION=...
+```
+
+!!! Note
+    By activating the Gentoo Prefix environment, the system tools (e.g. `ls`) you would normally use are now provided by Gentoo Prefix, instead of the container OS. E.g. running `which ls` after starting the prefix environment as above will return `/cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/x86_64/bin/ls`. This makes the builds completely independent from the container OS.
+
+### Starting the EESSI software environment
+!!! Note
+    If you want to replicate a build with `generic` optimization (i.e. in `$EESSI_CVMFS_REPO/versions/${EESSI_PILOT_VERSION}/software/${EESSI_OS_TYPE}/${EESSI_CPU_FAMILY}/generic`) you will need to set `export EESSI_SOFTWARE_SUBDIR_OVERRIDE=${EESSI_CPU_FAMILY}/generic` before starting the EESSI environment.
+
+To activate the software environment, run
+```
+source ${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/init/bash
+```
+
+!!! Note
+    If you get an error `bash: /versions//init/bash: No such file or directory`, you forgot to reset the `${EESSI_CVFMS_REPO}` and `${EESSI_PILOT_VERSION}` environment variables at the end of the previous step.
+
+!!! Note
+    If you want to build with generic optimization, you should run `export EESSI_CPU_FAMILY=$(uname -m) && export EESSI_SOFTWARE_SUBDIR_OVERRIDE=${EESSI_CPU_FAMILY}/generic` before sourcing.
+
+
+For more info on starting the EESSI software environment, see [here](../using_eessi/setting_up_environment.md)
+
+### Configure EasyBuild
+It is important that we configure EasyBuild in the same way as the bot uses it, with one small exceptions: our working directory will be different. Typically, that doesn't matter, but it's good to be aware of this one difference, in case you fail to replicate the build failure.
+
+In this example, we create a unique temporary directory inside `/tmp` to serve both as our workdir. Finally, we will source the `configure_easybuild` script, which will configure EasyBuild by setting environment variables.
+
+```
+export WORKDIR=$(mktemp --directory --tmpdir=/tmp  -t eessi-debug.XXXXXXXXXX)
+source configure_easybuild
+```
+Among other things, the `configure_easybuild` script sets the install path for EasyBuild to point to the correct installation directory in (to `${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/software/${EESSI_OS_TYPE}/${EESSI_SOFTWARE_SUBDIR}`). This is the exact same path the `bot` uses to build, and uses a writeable overlay filesystem in the container to write to a path in `/cvmfs` (which normally is read-only). This is identical to what the `bot` does.
+
+!!! Note
+    If you started the container using --resume, you may want WORKDIR to point to the workdir you created previously (instead of creating a new, temporary directory with `mktemp`).
+
+!!! Note
+    If you want to replicate a build with `generic` optimization (i.e. in `$EESSI_CVMFS_REPO/versions/${EESSI_PILOT_VERSION}/software/${EESSI_OS_TYPE}/${EESSI_CPU_FAMILY}/generic`) you will need to set `export EASYBUILD_OPTARCH=GENERIC` after sourcing `configure_easybuild`.
+
+
+Next, we need to determine the correct version of EasyBuild to load. Since [the example PR](https://github.com/EESSI/software-layer/pull/360) changes the file `eessi-2023.06-eb-4.8.1-2021b.yml`, this tells us the bot was using version `4.8.1` of EasyBuild to build this. Thus, we load that version of the EasyBuild module and check if everything was configured correctly:
+```
+module load EasyBuild/4.8.1
+eb --show-config
+```
+You should get something similar to
+
+```
+#
+# Current EasyBuild configuration
+# (C: command line argument, D: default value, E: environment variable, F: configuration file)
+#
+buildpath            (E) = /tmp/easybuild/easybuild/build
+containerpath        (E) = /tmp/easybuild/easybuild/containers
+debug                (E) = True
+experimental         (E) = True
+filter-deps          (E) = Autoconf, Automake, Autotools, binutils, bzip2, DBus, flex, gettext, gperf, help2man, intltool, libreadline, libtool, Lua, M4, makeinfo, ncurses, util-linux, XZ, zlib, Yasm
+filter-env-vars      (E) = LD_LIBRARY_PATH
+hooks                (E) = ${HOME}/software-layer/eb_hooks.py
+ignore-osdeps        (E) = True
+installpath          (E) = /tmp/easybuild/software/linux/aarch64/neoverse_n1
+module-extensions    (E) = True
+packagepath          (E) = /tmp/easybuild/easybuild/packages
+prefix               (E) = /tmp/easybuild/easybuild
+read-only-installdir (E) = True
+repositorypath       (E) = /tmp/easybuild/easybuild/ebfiles_repo
+robot-paths          (D) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_n1/software/EasyBuild/4.8.1/easybuild/easyconfigs
+rpath                (E) = True
+sourcepath           (E) = /tmp/easybuild/easybuild/sources:
+sysroot              (E) = /cvmfs/pilot.eessi-hpc.org/versions/2023.06/compat/linux/aarch64
+trace                (E) = True
+zip-logs             (E) = bzip2
+```
+
+## Building the software
+When the bot builds software, it loops over all EasyStack files that have been changed, and builds them using EasyBuild. However, a single PR may add multiple items to a single EasyStack file, and the issue you are trying to debug is probably in _one_ of them. Getting EasyBuild to build the full EasyStack file will create the most similar situation to what the bot does. However, you _may_ just want to build the individual software that has changed. Below, we describe both approaches.
+
+### Building everything in the EasyStack file
+In our [example PR](https://github.com/EESSI/software-layer/pull/360), the EasyStack file that was changed was `eessi-2023.06-eb-4.8.1-2021b.yml`. To build this, we run (in the directory that contains the checkout of this feature branch):
+```
+eb --easystack eessi-2023.06-eb-4.8.1-2021b.yml --robot
+```
+After some time, this build fails while trying to build `Plumed`, and we can access the build log to look for clues on why it failed.
+
+### Building an individual package
+In our [example PR](https://github.com/EESSI/software-layer/pull/360), the individual package that was added to `eessi-2023.06-eb-4.8.1-2021b.yml` was `LAMMPS-23Jun2022-foss-2021b-kokkos.eb`. We'll also have to (re)use any options that are listed in the EasyStack file for `LAMMPS-23Jun2022-foss-2021b-kokkos.eb`, in this case the option `--from-pr 19000`. Thus, to build, we run:
+```
+eb LAMMPS-23Jun2022-foss-2021b-kokkos.eb --robot --from-pr 19000
+```
+After some time, this build fails while trying to build `Plumed`, and we can access the build log to look for clues on why it failed.
+
+!!! Note
+    While this might be faster than the EasyStack-based approach, this is _not_ how the bot builds. So why it _may_ reproduce the failure the bot encounters, it may not reproduce the bug _at all_ (no failure) or run into _different_ bugs. If you want to be sure, use the EasyStack-based approach.
+
+## Known causes of issues in EESSI
+
+### The custom system prefix of the compatibility layer
+Some installations might expect the system root (sysroot, for short) to be in `/`. However, in case of EESSI, we are building against the OS in the [compatibility layer](../compatibility_layer.md). Thus, our sysroot is something like `${EESSI_CVMFS_REPO}/versions/${EESSI_PILOT_VERSION}/compat/${EESSI_OS_TYPE}/${EESSI_CPU_FAMILY}`. This _can_ cause issues if installation procedures _assume_ the sysroot is in `/`.
+
+One example of a sysroot [issue](https://github.com/EESSI/software-layer/pull/370#issuecomment-1774744151) was in installing `wget`. The EasyConfig for `wget` defined
+```
+# make sure pkg-config picks up system packages (OpenSSL & co)
+preconfigopts = "export PKG_CONFIG_PATH=/usr/lib64/pkgconfig:/usr/lib/pkgconfig:/usr/lib/x86_64-linux-gnu/pkgconfig && "
+configopts = '--with-ssl=openssl '
+```
+This will not work in EESSI, since the OpenSSL should be picked up from the compatibility layer. This was fixed by changing the EasyConfig to read
+```
+preconfigopts = "export PKG_CONFIG_PATH=%(sysroot)s/usr/lib64/pkgconfig:%(sysroot)s/usr/lib/pkgconfig:%(sysroot)s/usr/lib/x86_64-linux-gnu/pkgconfig && "
+configopts = '--with-ssl=openssl
+```
+The `%(sysroot)s` is a template value which EasyBuild will resolve to the value that has been configured in EasyBuild for `sysroot` (it is one of the fields printed by `eb --show-config` if a non-standard sysroot is configured).
+
+If you encounter issues where the installation can not find something that is _normally_ provided by the OS (i.e. _not_ one of the dependencies in your module environment), you may need to resort to a similar approach.
+
+### The writeable overlay
+The writeable overlay in the container is known to be a bit slow sometimes. Thus, we have seen tests failing because they exceed some timeout (e.g. [this issue](https://github.com/EESSI/software-layer/pull/332#issuecomment-1775374260)).
+
+To investigate if the writeable overlay is somehow the issue, you can make sure the installation gets done somewhere else, e.g. in the temporary directory in `/tmp` that you created as workdir. To do this, set
+
+```
+export EASYBUILD_INSTALLPATH=${WORKDIR}
+```
+
+_after_ the step in which you have sourced the `configure_easybuild` script. Note that in order to find (with `module av`) any modules that get installed here, you will need to add this path to the `MODULEPATH`:
+
+```
+module use ${EASYBUILD_INSTALLPATH}/modules/all
+```
+
+Then, retry building the software (as described above). If the build now succeeds, you know that indeed the writeable overlay caused the issue. We _have_ to build in this writeable overlay when we do real deployments. Thus, if you hit such a timeout, try to see if you can (temporarily) modify the timeout value in the test so that it passes.