dwarf-p-cloudsc
is intended to test the CLOUDSC cloud microphysics scheme of the IFS.
This package is made available to support research collaborations and is not officially supported by ECMWF
Michael Lange ([email protected]), Willem Deconinck ([email protected]) Balthasar Reuter ([email protected]),
dwarf-p-cloudsc
is distributed under the Apache Licence Version 2.0. See
LICENSE file for details.
- dwarf-P-cloudMicrophysics-IFSScheme: The original cloud scheme from IFS that is naturally suited to host-type machines and optimized on the Cray system at ECMWF.
- dwarf-cloudsc-fortran: A cleaned up version of the CLOUDSC prototype that validates runs against platform and language-agnostic off-line reference data via the Serialbox package. The kernel code also is slightly cleaner than the original version.
- dwarf-cloudsc-c: Standalone C version of the kernel that has been generated by ECMWF tools. This also requires the serialbox validation mechanism as above.
- dwarf-cloudsc-gpu-kernels: GPU-enabled version of the CLOUDSC dwarf
that uses OpenACC and relies on the
!$acc kernels
directive to offload the computational kernel. - dwarf-cloudsc-gpu-claw: GPU-enabled and optimized version of CLOUDSC that is based on an auto-generated version of CLOUDSC based on the CLAW tool. The kernel in this demonstrator has been further optimized with gang-level loop blocking to demonstrate potential performance gains.
- dwarf-cloudsc-gpu-scc: GPU-enabled and optimized version of
CLOUDSC that utilises the native blocked IFS memory layout via a
"single-column coalesced" (SCC) loop layout. Here the outer NPROMA
block loop is mapped to the OpenACC "gang" level and the kernel uses
an inverted loop-nest where the outer horizontal loop is mapped to
OpenACC " vector" parallelism. This variant lets the CUDA runtime
manage temporary arrays and needs a large
PGI_ACC_CUDA_HEAPSIZE
(eg.PGI_ACC_CUDA_HEAPSIZE=8GB
for 160K columns.) - dwarf-cloudsc-gpu-scc-hoist: GPU-enabled and optimized version of CLOUDSC that also uses the SCC loop layout, but promotes the inner "vector" loop to the driver and declares the kernel as sequential. The block array arguments are fully dimensioned though, and multi-dimensional temporaries have been declared explicitly at the driver level.
The code is written in Fortran 2003 and it has been tested using the various compilers, including:
GCC 7.3, 9.3
Cray 8.7.7
NVHPC 20.9
Intel
This application does not need MPI nor BLAS libraries for performance. Just a compiler that understands OpenMP directives. Fortran must be at least level F2003.
Inside the dwarf directory you can find some example of outputs inside the example-outputs/ directory.
In addition, to run the dwarf it is necessary to use an input file that can be found inside the config-files/ directory winthin the dwarf folder.
The preferred method to install the CLOUDSC dwarf uses the bundle definition shipped in the main repository. For this please install the bundle via:
./cloudsc-bundle create # Checks out dependency packages
./cloudsc-bundle build [--build-type=debug|bit|release] [--arch=$PWD/arch/ecmwf/machine/compiler/version/env.sh]
The individual prototype variants of the dwarf are managed as ECBuild features
and can be enable or disabled via --cloudsc-<feature>=[ON|OFF]
arguments to
cloudsc-bundle build
.
The use of the boost
library or module is required by the Serialbox
utility package for filesystem utilities. If boost
is not available
on a given system, Serialbox's internal "experimental filesystem" can
be used via the --serialbox-experimental=ON
argument, although this
has proven difficult with certain compiler toolchains.
The GPU-enabled versions of the dwarf are by default disabled. To
enable them use the --with-gpu
flag. For example:
./cloudsc-bundle create # Checks out dependency packages
./cloudsc-bundle build --clean --with-gpu --arch=$PWD/arch/ecmwf/volta/pgi-gpu/20.9/env.sh
Optionally, dwarf-cloudsc-fortran and the GPU versions can be built with
MPI support by providing the --with-mpi
flag. For example on volta:
./cloudsc-bundle create
./cloudsc-bundle build --clean --with-mpi --with-gpu --arch=$PWD/arch/ecmwf/volta/pgi-gpu/20.9/env.sh
Running with MPI parallelization distributes the columns of the working set among all ranks. The specified number of OpenMP threads is then spawned on each rank. Results are gathered from all ranks and reported for the global working set. Performance numbers are also gathered and reported per thread, per rank and total.
When running with multiple GPUs each rank needs to be assigned a different
device. This can be achieved using the CUDA_VISIBLE_DEVICES
environment
variable:
mpirun -np 2 bash -c "CUDA_VISIBLE_DEVICES=\${OMPI_COMM_WORLD_RANK} bin/dwarf-cloudsc-gpu-claw 1 163840 8192"
As an alternative to Serialbox, versions dwarf-cloudsc-fortran as well as GPU
and Loki versions can use HDF5 files for input and reference data. To enable this,
use the --with-hdf5
flag (note that this disables Serialbox support).
Please note : the hdf5 installation needs to have the f03 interfaces installed.
The original input is provided as raw Fortran binary in prototype1, but input and reference data can be regenerated from this variant by running
CLOUDSC_WRITE_INPUT=1 ./bin/dwarf-P-cloudMicrophysics-IFSScheme 1 100 100
CLOUDSC_WRITE_REFERENCE=1 ./bin/dwarf-P-cloudMicrophysics-IFSScheme 1 100 100
Note that this is only available via Serialbox at the moment. Updates to HDF5 input or reference data have to be done via manual conversion.
Preliminary results for CLOUDSC have been generated for A64FX CPUs on Isambard. A set of arch and toolchain files and detailed installation and run instructions are provided here.
The different prototype variants of the dwarf create different binaries that all behave similarly. The basic three arguments define (in this order):
- Number of OpenMP threads
- Size of overall working set in columns
- Block size (NPROMA) in columns
An example:
cd build
./bin/dwarf-P-cloudMicrophysics-IFSScheme 4 16384 32 # The original
./bin/dwarf-cloudsc-fortran 4 16384 32 # The cleaned-up Fortran
./bin/dwarf-cloudsc-c 4 16384 32 # The standalone C version
On the Atos system, a high-watermark run on a single socket can be performed as follows:
export OMP_NUM_THREADS=64
OMP_PLACES="{$(seq -s '},{' 0 $(($OMP_NUM_THREADS-1)) )}" srun -q np --ntasks=1 --hint=nomultithread --cpus-per-task=$OMP_NUM_THREADS ./bin/dwarf-cloudsc-fortran $OMP_NUM_THREADS 163840 32
For a build with the intel 2021.1.1 compiler, performance of ~74 GF is achieved.
Loki is an in-house developed source-to-source translation tool that allows us to create bespoke transformations for the IFS to target and experiment with emerging HPC architectures and programming models. We use the CLOUDSC dwarf as a demonstrator for targeted transformation capabilities of physics and grid point computations kernels, including conversion to C and GPU via downstream tools like CLAW.
To use the Loki demonstrators, Loki and CLAW need to be installed as described in the Loki install instructions. Please note that the in-house "volta" machine needs some manual workarounds for this atm.
Once Loki and CLAW are installed and activated via source loki-activate
,
the following build flags enable the demonstrator build targets:
# For general use on workstations with GNU
# Please note that OpenACC needs to be disable with GNU,
# since CLAW-generated code currently does not comply with GNU.
./cloudsc-bundle build --clean --with-loki --loki-frontend=fp --cmake="ENABLE_ACC=OFF" --arch=$PWD/arch/ecmwf/leap42/gnu/7.3.0/env.sh
# For GPU exploration on volta
./cloudsc-bundle build --clean [--with-gpu]--with-loki --loki-frontend=fp --arch=$PWD/arch/ecmwf/volta/pgi-gpu/20.9/env.sh
The following Loki modes are included in the dwarf, each with a bespoke demonstrator build:
- cloudsc-loki-idem: "Idempotence" mode that performs a full parse-unparse cycle of the kernel and performs various housekeeping transformations, including the driver-level source injection mechanism currently facilitated by Loki.
- cloudsc-loki-sca: Pure single-column mode that strips all horizontal vector loops from the kernel and introduces an outer "column-loop" at the driver level.
- cloudsc-loki-claw-cpu: Same as SCA, but also adds the necessary CLAW annotations. The resulting cloudsc.claw.F90 file is then processed by CLAW to re-insert vector loops for optimal CPU execution.
- cloudsc-loki-claw-gpu: Creates the same CLAW-ready kernel file, but triggers the GPU-specific optimizations in the CLAW compiler to insert OpenACC-offload instructions in the driver and an OpenACC parallel loop inside the kernel for each block. This needs to be run with large block sizes (eg. NPROMA=1024-8192).
- cloudsc-loki-c: A prototype C transpilation pipeline that converts the kernel to C and calls it via iso_c_bindings interfaces from the driver.
Loki currently supports three frontends to parse the Fortran source code:
- FParser (
loki-frontend=fp
): The preferred default; developed by STFC for PsyClone. - OMNI frontend (
loki-frontend=omni
): Generates the same AST as used by CLAW. - OFP,
a Python wrapper around the ROSE frontend (
loki-frontend=ofp
): Supported, but bugged in some places and slow; use with care.
For completeness, all three frontends are tested in our CI, which
means we require the .xmod
module description files for utility
routines in src/common
for processing the CLOUDSC source files with
the OMNI frontend. These are stored in the source under
src/cloudsc_loki/xmod
.
The original CLOUDSC kernel contains a bug that forces the use of a single precision constant for an exponential computation. This has been corrected in the Loki-specific variants, resulting in small deviations in the final results for some variables against the reference data from the original version.