Skip to content

Commit

Permalink
Instructions for large use case setup (#205)
Browse files Browse the repository at this point in the history
* ref: start restructuring for uses cases

* add: skeleton for large use cases procedure

* FIX: typo + minor

* fix: reword introduction

* fix: reword (again!)

* FIX: (very) minor

* Fix broken links

* Add fancy flowchart

* GitHub Action: Apply external link format

* Fix yet another broken link

* Add venv directory to .gitignore

* Change structure

* Add more information about testing

* GitHub Action: Apply external link format

* fix: add subitems to small scale

* Update local testing

* GitHub Action: Apply external link format

* recommendation for out-of-source builds

* Update install instructions for Daint and Euler

* Update local testing instructions

* Fix instructions for Euler

* Run tolerance test

* Fix links

* Add instructions for activating test in a CI pipeline

* Improve instructions

* More detailed instructions for out-of-source builds

* Fix typo

* Add headers to Set up

* Update large_use_cases.md

* ref: introduce exclaim earlier

* small language fixes

* Language fixes, subtitles, minor improvements

---------

Co-authored-by: Michael Jähn <[email protected]>
Co-authored-by: mjaehn <[email protected]>
Co-authored-by: Annika Lauber <[email protected]>
Co-authored-by: AnnikaLau <[email protected]>
Co-authored-by: juckerj <[email protected]>
  • Loading branch information
6 people authored Nov 19, 2024
1 parent 47a1778 commit b1b271e
Show file tree
Hide file tree
Showing 8 changed files with 236 additions and 71 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
site/
venv/
2 changes: 1 addition & 1 deletion docs/alps/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ The following table shows the current plan for the final vClusters distribution

## Early Access

For getting access to the vCluster dedicated to testing ([Tödi](vclusters.md/#tödi){:target="_blank"}), CSCS offers [Preparatory Projects :material-open-in-new:](https://www.cscs.ch/user-lab/allocation-schemes/preparatory-projects){:target="_blank"}.
For getting access to the vCluster dedicated to testing ([Tödi](vclusters.md/#todi){:target="_blank"}), CSCS offers [Preparatory Projects :material-open-in-new:](https://www.cscs.ch/user-lab/allocation-schemes/preparatory-projects){:target="_blank"}.

## Support by CSCS

Expand Down
2 changes: 1 addition & 1 deletion docs/events/icon_meetings/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ In case you have any questions or suggestions, contact the meeting organiser,

## C2SM ICON Mailing List

As a member of the `c2sm.icon` mailing list, you will receive all relevant information around [ICON](../../models/icon.md) and invitations to the quarterly ICON meeting.
As a member of the `c2sm.icon` mailing list, you will receive all relevant information around [ICON](../../models/icon/index.md) and invitations to the quarterly ICON meeting.

If you or someone from your group is not yet a member of the `c2sm.icon` mailing list, subscribe by sending an e-mail to:
[`mailto:[email protected]?subject=SUBSCRIBE%20c2sm.icon%20firstname%20lastname`](mailto:[email protected]?subject=SUBSCRIBE%20c2sm.icon%20firstname%20lastname) (modify `firstname` and `lastname` in the subject).
Expand Down
3 changes: 3 additions & 0 deletions docs/models/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
* [ICON](icon/)
* [CESM](cesm.md)
* [COSMO](cosmo.md)
3 changes: 3 additions & 0 deletions docs/models/icon/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
* [ICON](index.md)
* [Usage](usage.md)
* [Large Use Cases](large_use_cases.md)
40 changes: 40 additions & 0 deletions docs/models/icon/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# ICON
ICON (Icosahedral Nonhydrostatic Weather and Climate Model) is a global model suitable for climate and weather prediction at regional and global domains.
It is a joint project of [DWD :material-open-in-new:](https://www.dwd.de/DE/Home/home_node.html){:target="_blank"}, [MPI-M :material-open-in-new:](https://mpimet.mpg.de/startseite){:target="_blank"} and [KIT :material-open-in-new:](https://www.kit.edu/){:target="_blank"}.

To stay informed about what is going on in the ICON world and to get to know other ICON users, please attend our [quarterly ICON meeting](../../events/icon_meetings/index.md).

## Support status
C2SM facilitates the utilisation of ICON on the [Piz Daint :material-open-in-new:](https://www.cscs.ch/computers/piz-daint){:target="_blank"} and [Euler :material-open-in-new:](https://scicomp.ethz.ch/wiki/Euler){:target="_blank"} computing platforms for the CPU and GPU architectures.

### Supported release
The latest release distributed by C2SM, currently [`2024.07` :material-open-in-new:](https://github.com/C2SM/icon/tree/2024.07){:target="_blank"}, is continuously being tested on both Piz Daint and Euler and receives patches when necessary.

## Mailing list
If you use ICON, please follow [these instructions](../../events/icon_meetings/index.md#c2sm-icon-mailing-list) to subscribe to our mailing list.

## Toolset
In the [Tools](../../tools/index.md) section, you will find relevant tools for working with ICON:

* [**Extpar**](../../tools/extpar.md): External parameters for the ICON grid (preprocessing)
* [**Processing Chain**](../../tools/processing_chain.md): Python workflow tool for ICON
* [**SPICE**](../../tools/spice.md): Starter package for ICON-CLM experiments
* [**icon-vis**](../../tools/icon-vis.md): Python scripts to visualise ICON data

## Projects
Learn more about ongoing projects involving ETHZ in the development of ICON:

* [EXCLAIM :material-open-in-new:](https://exclaim.ethz.ch/){:target="_blank"}
* [ICON-HAMMOZ :material-open-in-new:](https://redmine.hammoz.ethz.ch/projects/icon-hammoz){:target="_blank"}

## Documentation
ICON documentation is available at:

* [ICON Tutorial (DWD) :material-open-in-new:](https://www.dwd.de/DE/leistungen/nwv_icon_tutorial/nwv_icon_tutorial.html){:target="_blank"}
* [Getting Started with ICON :material-open-in-new:](https://www.icon-model.org/icon_model/getting_started){:target="_blank"}
* [MPI-M documentation :material-open-in-new:](https://code.mpimet.mpg.de/projects/iconpublic/wiki/Documentation){:target="_blank"}

## External Software
The following external software is useful for working with ICON data:

* [CDO :material-open-in-new:](https://code.zmaw.de/projects/cdo){:target="_blank"}
131 changes: 131 additions & 0 deletions docs/models/icon/large_use_cases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Large Use Cases

[ICON :material-open-in-new:](https://www.icon-model.org/icon_model){:target="_blank"} is a complex piece of software and even more so is [ICON-EXCLAIM :material-open-in-new:](https://github.com/C2SM/icon-exclaim){:target="_blank"} that builds on top of it. Troubleshooting large scale configurations can therefore be tedious, which is why we developed a procedure to build large production ICON configurations in the most robust way possible.

The overall philosophy is to build a series of gradually increasing complexity setups from a small scale ICON-NWP test case to the full production configuration. Complexity can grow in two independent spaces, namely code (from ICON-NWP to ICON-EXCLAIM) and scales (resolution and duration). We will first tackle the first one and then scale up the simulation setup.

Even if it could feel like an overhead when starting the whole process, C2SM's core team and the EXCLAIM team are there to assist you in this journey and it will pay off in the end!

## Flow Chart

```mermaid
flowchart TD
subgraph SMALL["1 - Small Scale Test Case"]
STnwp[Small Scale Test Case ICON-NWP] -.- CPU & GPU
STnwp --> MR[Merge Request for icon-nwp]
MR --> BB[BuildBot]
STnwp --> STexc[Small Scale Test Case ICON-EXCLAIM]
end
subgraph INT["2 - Intermediate Scale Test"]
IT[Intermediate Scale Test] & LST[Longer small Scale Test]
end
subgraph FULL["3 - Full Scale Test"]
direction LR
FT[Full Scale Test]
end
SMALL ==> INT ==> FULL
```

## 1. Small Scale Test Case

The idea here is to test the code path of the final setup and identify potential issues coming from upstream source code.

### 1.1 Set up

#### Clone

First, clone [`icon-nwp` :material-open-in-new:](https://gitlab.dkrz.de/icon/icon-nwp){:target="_blank"} (if you don't have access, you need to request it by DKRZ):

```bash
git clone --recurse-submodules [email protected]:icon/icon-nwp.git
```

#### Create Test Case

Then set up an ICON test case with a low number of grid points and a few time steps (about 6) and save it under `run/exp.<my_exp>`. Existing use cases like the [Aquaplanet :material-open-in-new:](https://gitlab.dkrz.de/icon/icon-nwp/-/blob/master/run/exp.exclaim_ape_R02B04){:target="_blank"} one can serve as a template.

#### Add Test Case to Checksuite

Follow the step-by-step guide in [How to add experiments to a buildbot list :material-open-in-new:](https://gitlab.dkrz.de/icon/wiki/-/wikis/How-to-setup-new-test-experiments-for-buildbot#how-to-add-experiments-to-a-buildbot-list){:target="_blank"} to add your test case to the checksuite. Start with the `checksuite_modes` for the mpi and nproma tests (`'nm'`) for the machine you are testing on.

#### Compile Out-of-Source

We recommend you to do out-of-source builds for CPU and GPU so that you can have two compiled versions of ICON in the same repository. Therefore, you simply need to create two folders in the ICON root folder (e.g., `nvhpc_cpu` and `nvhpc_cpu`) and copy the folders `config` and `scripts` from the root folder into it:

```bash
mkdir nvhpc_cpu
cd nvhpc_cpu
cp -r ../config ../scripts .
mkdir nvhpc_gpu
cd nvhpc_gpu
cp -r ../config ../scripts .
```

Then follow the instructions in [Configure and compile :material-open-in-new:](usage.md/#configure-and-compile){:target="_blank"} to compile ICON on CPU and on GPU from within those folders.

### 1.2 Local Testing

Before adding anything to the official ICON, we recommend you to run all tests locally first starting with CPU.

#### Test on CPU

To ensure that there are no basic issues with the namelist, we recommend to start testing on CPU before going over to GPU testing. Create the check file and run the test locally in the folder you built CPU in (set `EXP=<exp_name>`):

```bash
./make_runscripts ${EXP}
./run/make_target_runscript in_script=checksuite.icon-dev/check.${EXP} in_script=exec.iconrun out_script=check.${EXP}.run EXPNAME=${EXP}
cd run
sbatch --partition debug --time 00:30:00 check.${EXP}.run
```

Check in the LOG file if all tests passed.

#### Test on GPU

If all tests are validating on CPU, the next step is to test on GPU. Follow the same steps as for CPU and run nproma and mpi test. Again, check in the LOG file to see if all tests passed before proceeding to the next step.

To ensure that running on GPU gives essentially the same results as running on CPU, please follow the instructions in [Validating with probtest without buildbot references (Generating tolerances for non standard tests) :material-open-in-new:](https://gitlab.dkrz.de/icon/wiki/-/wikis/GPU-development/Validating-with-probtest-without-buildbot-references-(Generating-tolerances-for-non-standard-tests)){:target="_blank"}). If probtest validates, you can change the `checksuite_modes` to `'t'` and everything is set for activating the test in a CI pipeline.

### 1.3 Activate Test in a CI Pipeline

If you followed the steps above in [1.2 Local testing](large_use_cases.md#12-local-testing), everything is set to activate the test in a CI pipeline. Therefore, push your changes to a branch on icon-nwp and open a merge request. Then follow the instructions in [Member selection for generating probtest tolerances :material-open-in-new:](https://gitlab.dkrz.de/icon/wiki/-/wikis/GPU-development/Member-selection-for-generating-probtest-tolerances){:target="_blank"} for adding tolerances and references as well as best members for generating them to the CI pipeline.

### 1.4 Small Test Case with ICON-EXCLAIM

Now it is time to switch to ICON-EXCLAIM, which binds ICON-NWP with modules rewritten in GT4PY, so that we can test the code path in those as well. To that purpose, simply take the small scale test case generated above and replace the ICON executable by the relevant one.

!!! note "ICON-EXCLAIM CI"

When avaialble, it would also make sense to integrate your setup in the ICON-EXCLAIM testing infrastrucutre.

## 2. Intermediate Scale Test

To test the scalability of ICON simulations under demanding conditions while minimizing queue wait times, we can design tests to stretch computational limits in memory and time. Here are some steps to extend your setup to meet these goals.

### 2.1 Increase Horizontal Resolution with Fixed Node Count

*Goal:* Test memory scaling behavior by increasing resolution without increasing node count.

*Method:*
Increase the model's horizontal grid resolution (i.e., decrease grid spacing) to improve spatial accuracy.
This will require more memory per node due to higher data density but will not increase the node count.
Monitor the memory usage closely on each node, and consider using profiling tools to track memory allocation patterns and potential memory overflow risks.


*Expected Outcome:* This test will reveal the memory thresholds on individual nodes for your current setup and may highlight areas where memory optimization or node allocation adjustments are necessary.

!!! note "Monitor Memory Usage"

Approaching the memory limits without exceeding them can help identify how close the current node allocation is to becoming unsustainable at higher resolutions. If necessary, consider using memory-aware scheduling tools to balance memory loads across nodes if available.

### 2.2 Run Longer Simulations (e.g., One Year vs. One Month)

*Goal:* Assess the model's stability and resource consumption over prolonged simulation periods, revealing any potential issues with computational drift or resource leaks.

*Method:* Run the small scale test for an extended period, e.g., one year instead of one month, to test how well the model holds up over time.

*Expected Outcome:* Catching issues like numerical drift, stability loss, or escalating memory/CPU demands that aren’t noticeable in shorter simulations.

## 3. Full Scale Test

At the end of this journey, we're finally ready to launch the full scale runs and start doing science with them! :material-party-popper:
Loading

0 comments on commit b1b271e

Please sign in to comment.