This case study uses E3SM-IO to evaluate the performance of the HDF5 log-layout based VOL, compared with methods using other I/O libraries. E3SM-IO is an I/O benchmark suite that measures the performance I/O kernel of E3SM, a state-of-the-science Earth system modeling, simulation, and prediction project. The I/O patterns of E3SM captured by the E3SM's I/O module, Scorpio from the production runs, are used in the benchmark.
- Prerequisite
- HDF5 1.13.0, required by any HDF5 VOL
- HDF5 log-layout based VOL version 1.3.0
- Clone E3SM-IO from its github repository:
git clone https://github.com/Parallel-NetCDF/E3SM-IO.git
- Configure E3SM-IO with HDF5 and log-layout based VOL features enabled.
Full configuration options are available in E3SM-IO's INSTALL.md
cd E3SM-IO autoreconf -i ./configure --with-hdf5=${HOME}/hdf5/1.13.0 --with-logvol=${HOME}/log_based_vol/1.3.0
- Compile and link
The executable named
make -j 64
src/e3sm_io
will be created.
- Run with HDF5 log-layout based VOL as the I/O method
mpiexec -np 16 src/e3sm_io -a hdf5_log -x log -k -o ${HOME}/e3sm_io_log datasets/f_case_866x72_16p.nc
- Run with HDF5 native VOL as the I/O method
mpiexec -np 16 src/e3sm_io -a hdf5 -x canonical -k -o ${HOME}/e3sm_io_native datasets/f_case_866x72_16p.nc
- The above two commands run the small-scale F case using the data partitioning patterns,
referred to as
decomposition maps
in Scorpio, generated from a 16-process run. This decomposition map file comes with the E3SM-IO, along with two other cases, G and I. - Information about the decomposition maps is available in datasets/README.md
- Details of command-line options can be found in the E3SM-IO's INSTALL.md
The E3SM-IO benchmark studies the I/O performance of three E3SM cases.
- F case - the atmospheric component
- G case - the oceanic component
- I case - the land component. The F case and the I case produce two history files, referred to as h0 and h1. The G case produces only one file.
The I/O related information of our evaluations is provided in the table below.
Output file | F-H0 | F-H1 | G | I-H0 | I-H1 |
---|---|---|---|---|---|
Number of MPI processes | 21600 | 21600 | 9600 | 1344 | 1344 |
Total size of data written (GiB) | 14.09 | 6.68 | 79.69 | 86.11 | 0.36 |
Number of fixed sized variables | 15 | 15 | 11 | 18 | 10 |
Number of record variables | 399 | 36 | 41 | 542 | 542 |
Number of time records | 1 | 25 | 1 | 240 | 1 |
Number of variables not partitioned | 27 | 27 | 11 | 14 | 14 |
Number of variables partitioned | 387 | 24 | 41 | 546 | 538 |
Number of non-contig requests | 174953 | 83261 | 20888 | 9248875 | 38650 |
Number of attributes | 1427 | 148 | 858 | 2789 | 2759 |
The performance numbers presented here compare three I/O methods used in E3SM-IO: the log-layout based VOL, PnetCDF, and ADIOS.
The PnetCDF method stores E3SM variables in files in a canonical storage layout. For each partitioned variable, each process writes multiple non-contiguous requests. The PnetCDF's non-blocking APIs are used to enabled request aggregation to improve performance. However, storing data in the canonical order requires inter-process communications in MPI collective I/O, which can be expensive. Given E3SM's data partitioning patterns containing large numbers of noncontiguous write requests, the communication cost can be very expensive. Therefore, it is expected the PnetCDF method performs slower than the log-layout based VOL, which stores data in the log layout, requiring no inter-process communications.
Scorpio implements an I/O option to use ADIOS library to write data. In Scorpio's implementation, each process stores the write data in ADIOS's local variables by appending one write request data after another. These local variables are only a collection of data blocks without any ADIOS metadata describing their logical location. Instead, Scorpio stores the metadata, such as write data's canonical location, as additional ADIOS variables, which will be used to convert the BP files to the NetCDF files later.
Performance chart below shows the execution time, collected in September 2021, on Cori at NERSC. All runs were on the KNL nodes, with 64 MPI processes allocated per node.
Both Log-layout based VOL and ADIOS runs enabled their subfiling feature, which creates one file per compute node. The Lustre striping configuration is set to striping count of 8 and striping size of 1 MiB.
Performance chart below shows the execution time, collected in September 2021, on Summit at OLCF. All runs allocated 84 MPI processes per node. Summit's parallel file system, Spectrum file system (GPFS), was used.