- Introduction
- Build Instructions for Different Backends
- Build Instructions for Cori
- Validation
- Formatting
This repository contains the source code for the standalone version of the ATLAS experiment's Liquid Argon Calorimeter parametrized simulation. The original code was written in C++ for CPUs as part of the ATLAS simulation software. A minimal form was extracted from the ATLAS repository to enable rapid development in a standalone environment without the multitude of ATLAS dependencies. This code was then rewritten to run on GPUs using CUDA as reported here
In order to study various portability layers to enable execution on different GPU architectures, the code has also been ported to Kokkos, HIP, SYCL, alpaka and std::par. Build instructions for these various technologies are listed below.
FastCaloSim has the following dependencies:
- cmake (3.18 or higher)
- C++ compiler that is compatible with C++17. Recent versions of gcc, icpx and hipcc are recommended
- ROOT. A recent version is recommended, with newer versions of the C++ compiler necessitating newer versions of ROOT
Build instructions for cori.nersc.gov
are shown below as an
example. These should be easily replicable on any modern system with
the appropriate backends and hardware installed.
The CUDA, Kokkos, alpaka and std::par versions can all be built from
the same branch of the repository (main
). The SYCL version is in the
sycl
branch, and the HIP version is in hip
. The following build
instructions assume that the repository has been cloned into the directory
named src
and the input data in $FCS_INPUTS
.
Two different versions of the code have been developed. One simulates particle interactions one at a time, the other groups a number of particles together before offloading the work to the GPU to increase the GPU's workload. The latter is referred to as the "group simulation".
After the project has been built, source the setup.sh
script in the build
directory. To see all available options of the executable do
> runTFCSSimulation -h
A normal run will resemble
> runTFCSSimulation --earlyReturn --energy 65536
First setup cmake
and ROOT
. If ROOT
is installed in $ROOT_PATH
, ensure
that your $LD_LIBRARY_PATH
includes
$ROOT_PATH/lib
and $CMAKE_PREFIX_PATH
includes $ROOT_PATH
.
The simulation requires input data and parametrization files. The base directory of
these files can be specified either via the runtime parameter --dataDir=DIR
or by
setting the environment variable FCS_DATAPATH
.
cmake ../src/FastCaloSimAnalyzer \
-DENABLE_XROOTD=Off -DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_EXTENSIONS=Off \
-DENABLE_GPU=off
Checkout from branch main
. For group simulation use branch group_sim_combined
.
Build the project for an NVIDIA A100:
cmake ../src/FastCaloSimAnalyzer \
-DENABLE_XROOTD=Off -DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_EXTENSIONS=Off \
-DENABLE_GPU=on -DCMAKE_CUDA_ARCHITECTURES=80
Kokkos must be built with -DBUILD_SHARED_LIBS=On
.
For the CUDA backend, also use
-DKokkos_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE=Off -DKokkos_ENABLE_CUDA_LAMBDA=On
.
Checkout from branch main
. For group simulation use branch group_sim_combined
.
Build the project with
cmake ../src/FastCaloSimAnalyzer \
-DENABLE_XROOTD=Off -DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_EXTENSIONS=Off \
-DCMAKE_CXX_COMPILER=nvcc_wrapper \
-DENABLE_GPU=on -DUSE_KOKKOS=ON
Other hardware backend architectures are also supported. FastCaloSim has been tested with the following Kokkos architectures:
- CUDA
- HIP
- SYCL
- pThreads
- OpenMP
- Serial
you will need to adjust the value of the cmake option -DCMAKE_CXX_COMPILER=
or the environment variable $CXX
accordingly.
In order to run with SYCL, the sycl
branch of repository should be
used. If the GPU does not support double precision types, such as the
Intel A770 GPU, use the sycl_A770
branch. It is recommended that
ROOT be built with the same compiler that is used to build
FastCaloSim, be it icpx or clang.
Checkout from branch sycl
. Build the project with
cmake ../src/FastCaloSimAnalyzer \
-DENABLE_XROOTD=Off -DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_EXTENSIONS=Off \
-DENABLE_SYCL=ON -DSYCL_TARGET_GPU=ON
SYCL has been tested using the icpx (oneAPI) compiler on Intel GPUs, and llvm on NVIDIA and AMD GPUs.
Checkout from branch main
. For group simulation use branch group_sim_combined
.
Build the project with
cmake ../src/FastCaloSimAnalyzer \
-DENABLE_XROOTD=Off -DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_EXTENSIONS=Off \
-DCMAKE_CXX_COMPILER=$PWD/../src/scripts/nvc++_p \
-DENABLE_GPU=on -DUSE_STDPAR=ON -DSTDPAR_TARGET=gpu -DCMAKE_CUDA_ARCHITECTURES=80
Use cmake flag -DUSE_STDPAR=On
.
In order to compile for std::par, nvc++
from the nvidia nvhpc package must be
chosen for the CXX compiler. However ROOT still cannot build with nvc++, so part
of FastCaloSim must be built with g++. Also, nvc++ is not well supported in cmake,
and a number of compiler flags must be removed from the command line for it to work.
A wrapper script is provided in scripts/nvc++p which chooses the correct compiler
for the various parts of FastCaloSim, and filters out the problematic compiler flags
for nvc++. Either set the CXX
environment variable to point to this, or explicitly
set it during cmake configuration with -DCMAKE_CXX_COMPILER=$PWD/../src/scripts/nvc++_p
.
You may need to edit the script to pickup the correct localrc configuration file for
nvc++. These can be generated with makelocalrc
from the nvhpc package.
To see exactly what the wrapper script is doing, set the env var NVCPP_VERBOSE=1
.
There are 3 backends for std::par: gpu, multicore, and serial cpu. These are normally
triggered by the nvc++ flags -stdpar=gpu
, -stdpar=multicore
and -nostdpar
. Select
the desired backend with the cmake flags -DSTDPAR_TARGET=XXX
where XXX
is one of
gpu
, mutlicore
or cpu
. If cpu
is selected, the random numbers must be generated
on the cpu with -DRNDGEN_CPU=On
When profiling using nsys
, make sure to pick it up from the nvhpc package, and not
directly from cuda.
When using the multicore backend, it is currently recommended to set the environment
variables NV_NOSWITCHERROR=1
.
Checkout from branch main
. For group simulation use branch group_sim_combined
.
Build the project with
cmake ../src/FastCaloSimAnalyzer \
-DENABLE_GPU=on -DUSE_HIP=on -DCMAKE_CXX_COMPILER=hipcc \
-DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_EXTENSIONS=Off
module use /work/software/modulefiles
module load rocmmod4.5.0
source /work/atif/packages/root-6.24-gcc-9.3.0/bin/thisroot.sh
export FCS_DATAPATH=/work/atif/FastCaloSimInputs/
/work/atif/packages/cmake-3.25.0-linux-x86_64/bin/cmake ../FastCaloSimAnalyzer \
-DENABLE_XROOTD=Off -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=hipcc \
-DCMAKE_CXX_STANDARD=14 -DCMAKE_CXX_EXTENSIONS=Off -DENABLE_GPU=on \
-DUSE_HIP=on -DHIP_TARGET=AMD -DCMAKE_CXX_FLAGS="-I/opt/rocm/hip/include/hip/"
For Nvidia backend with HIP select HIP_PLATFORM=nvidia, HIP_COMPILER=nvcc, HIP_RUNTIME=cuda and use hipcc_nvidia compiler script
export HIP_PLATFORM=nvidia
export HIP_COMPILER=nvcc
export HIP_RUNTIME=cuda
module load hip
export ROCM_PATH=/global/common/software/nersc/pe/rocm/5.5.1
export FCS_DATAPATH=/pscratch/sd/a/atif/FastCaloSimInputs
source /global/homes/a/atif/packages/root_install/bin/thisroot.sh
cmake ../FastCaloSimAnalyzer/ -DENABLE_XROOTD=Off -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=/global/homes/a/atif/FCS-GPU/scripts/hipcc_nvidia -DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_EXTENSIONS=Off -DENABLE_GPU=on -DUSE_HIP=on -DHIP_TARGET=NVIDIA -DCMAKE_LIBRARY_PATH="/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/math_libs/11.7/lib64/;/global/common/software/nersc/pe/rocm/5.5.1/hip/include/hip/" -DRNDGEN_CPU=on
The alpaka version of FastCaloSim has been tested with two backends: CUDA and HIP. For the former backend alpaka should be configured with -Dalpaka_ACC_GPU_CUDA_ENABLE=ON
, while for the latter one should use -Dalpaka_ACC_GPU_HIP_ENABLE=ON
. For more information about CMake
arguments used by alpaka see this documentation.
For building the alpaka version of FastCaloSim checkout from branch dev/stdpar
, for group simulation use branch group_sim_combined
.
For CUDA backend build the project with:
cmake ../src/FastCaloSimAnalyzer -DENABLE_XROOTD=off -DENABLE_GPU=on \
-DCMAKE_CXX_STANDARD=17 -DCMAKE_CUDA_ARCHITECTURES=N -DUSE_ALPAKA=on -Dalpaka_ROOT=<path_to_alpaka_installation> \
-Dalpaka_ACC_GPU_CUDA_ENABLE=ON -Dalpaka_ACC_GPU_CUDA_ONLY_MODE=ON
For HIP backend build the project with:
export CXX=hipcc
cmake ../src/FastCaloSimAnalyzer -DENABLE_XROOTD=off -DENABLE_GPU=on \
-DCMAKE_CXX_STANDARD=17 -DUSE_ALPAKA=on -Dalpaka_ROOT=<path_to_alpaka_installation> \
-Dalpaka_ACC_GPU_HIP_ENABLE=ON -Dalpaka_ACC_GPU_HIP_ONLY_MODE=ON
To enable OpenMP Target Offloading with Clang-15.0.0 and above, appropriate hardware info such as --offload-arch=sm_xy for NVIDIA and --offload-arch=gfx90x should be edited in CMAKE_CXX_FLAGS in FastCaloGpu/src/CMakeLists.txt.
Checkout from branch openmp
. For group simulation use branch group_openmp
.
For alpha/lambda machines at CSI,BNL load the modules
module use /work/software/modulefiles
module load llvm-openmp-dev
source /work/atif/packages/root-6.24-gcc-9.3.0/bin/thisroot.sh
export FCS_DATAPATH=/work/atif/FastCaloSimInputs/
export OMP_TARGET_OFFLOAD=mandatory (mandatory | disabled)
/work/atif/packages/cmake-3.25.0-linux-x86_64/bin/cmake ../FastCaloSimAnalyzer -DENABLE_XROOTD=off -DENABLE_GPU=on -DRNDGEN_CPU=off -DENABLE_OMPGPU=on -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_CXX_STANDARD=14 -DCMAKE_CUDA_ARCHITECTURES=86 -DCUDAToolkit_ROOT=/usr/local/cuda/ -DCMAKE_CXX_FLAGS="-I/usr/local/cuda/include" -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
./x86_64-ubuntu2004-clang150-opt/bin/runTFCSSimulation --energy=1048576 > cpu_1048576.log
./x86_64-ubuntu2004-clang150-opt/bin/runTFCSSimulation --energy=2097152 > cpu_2097152.log
./x86_64-ubuntu2004-clang150-opt/bin/runTFCSSimulation --energy=4194304 > cpu_4194304.log
ROOT v 6.14.08 (from tag) has been built with gcc8.3 and c++14 and installed in
/global/cfs/cdirs/atlas/leggett/root/v6-14-08_gcc83_c14
While FastCaloSim can be built with c++17 support, nvcc (for CUDA) can only handle c++14
ROOT was built with
module load gcc/8.3.0
module load cmake/3.14.4
export CC=`which gcc`
export CXX=`which g++`
cmake -Dcxx=14 -Dcxx14=ON ../src
make -j30 VERBOSE=1 >& make.log
an example script is here
(code checked out in directory "src")
module load cuda
module load gcc/8.3.0
module load cmake/3.14.4
export ROOTDIR=/global/cfs/cdirs/atlas/leggett/root/v6-14-08_gcc83
export CMAKE_PREFIX_PATH=$ROOTDIR
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ROOTDIR/lib
export PATH=$PATH:$ROOTDIR/bin
export CC=`which gcc`
export CXX=`which g++`
mkdir build
cd build
cmake ../src/FastCaloSimAnalyzer -DENABLE_XROOTD=off -DENABLE_GPU=on -DCMAKE_CXX_STANDARD=14
make -j 30 VERBOSE=1 >& make.log
to disable GPU, set option -DENABLE_GPU=off
source $BUILD_DIR/x86_64-linux15-gcc8-opt/setup.sh
module load esslurm
salloc -N 1 -t 30 -c 80 --gres=gpu:8 --exclusive -C gpu -A m1759
srun -N1 -n1 runTFCSSimulation --dataDir=/global/cfs/cdirs/atlas/leggett/data/FastCaloSimInputs
If Kokkos with a CUDA backend is already installed, ensure that the
environment variables KOKKOS_ROOT
points to the install area, and that
nvcc_wrapper
is in your PATH
. Otherwise:
Checkout Kokkos from [email protected]:kokkos/kokkos.git
into $KOKKOS_SRC
Set the env var KOKKOS_INSTALL_DIR
to an appropriate value. CPU and
GPU architectures must be chosen. See here. For example, on a Haswell CPU and V100 GPU:
Build with
cmake ../src \
-DCMAKE_INSTALL_PREFIX=${KOKKOS_INSTALL_DIR} \
-DKokkos_ARCH_HSW=${CPU_ARCH} \
-DBUILD_SHARED_LIBS=On -DKokkos_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE=Off \
-DKokkos_ENABLE_CUDA=On -DKokkos_ARCH_VOLTA72=On -DKokkos_ENABLE_CUDA_LAMBDA=On \
-DKokkos_ENABLE_SERIAL=On \
-DKokkos_CXX_STANDARD=14 \
-DCMAKE_CXX_COMPILER=${KOKKOS_SRC}/bin/nvcc_wrapper
make
make install
An example script to build and install Kokkos is here
To build with Kokkos instead of plain nvcc, make sure you have the Kokkos
environment loaded, and that $CXX
points to nvcc_wrapper
from Kokkos.
Then add -DUSE_KOKKOS=on
to the FastCaloSim cmake configuration
Random numbers are by default generated on the GPU if -DENABLE_GPU=On
is set
during configuration. Alternatively, to help comparison between CPU and GPU codes,
the random numbers can be generated on the CPU and transferred to the GPU. This is
enabled by setting the cmake configuration parameter -DRNDGEN_CPU=On
.
The number of hit cells and counts can be displayed by setting the environment
variable FCS_DUMP_HITCOUNT=1
at run time. This will result in an output like:
HitCellCount: 12 / 12 nhit: 55
HitCellCount: 48 / 60 nhit: 1220 *
HitCellCount: 76 / 136 nhit: 4944 *
HitCellCount: 6 / 142 nhit: 10
HitCellCount: 1 / 143 nhit: 1
Lines marked with an asterisk were executed on the GPU.
The FastCaloSimAnalyzer code has been formatted with:
find FastCaloSimAnalyzer -type f -iname \*.h -o -iname \*.cpp -o -iname \*.cxx -o -iname \*.cu | xargs clang-format-9 -i -style=file