Add restart test and support for restarts

This commit implements a way to restart simulations by saving both state and caches of component models, as well as the coupler fields. Given that caches are complex object, I implemented this using JLD2 files. The challenges with JLD2 files are that: - they are not MPI compatible, - they are not GPU compatible. For this reason, I have to move everything to the CPU, and have each process write to its own output. This adds a restriction: only the same number of MPI process (and the same machine) can be used for restarts. In addition to this, this approach requires component models to implement their functions to restore their caches. Something that can be improved in the future is that, ClimaAtmos is currently producing two checkpoints, one independently, and one from ClimaCoupler. This should not be needed, but it is currently needed because there's no other way to start ClimaAtmos at a different time. The other problem here is that the MPI test occasionally hangs (as it does in ClimaAtmos).
CliMA · Feb 26, 2025 · ad2dbaa · ad2dbaa
1 parent 9839b78
commit ad2dbaa
Show file tree

Hide file tree

Showing 26 changed files with 914 additions and 199 deletions.
diff --git a/.buildkite/pipeline.yml b/.buildkite/pipeline.yml
@@ -67,16 +67,6 @@ steps:
   - group: "Unit Tests"
     steps:
 
-      - label: "MPI Checkpointer unit tests"
-        key: "checkpointer_mpi_tests"
-        command: "srun julia --color=yes --project=test/ test/mpi_tests/checkpointer_mpi_tests.jl"
-        timeout_in_minutes: 20
-        env:
-          CLIMACOMMS_CONTEXT: "MPI"
-        agents:
-          slurm_ntasks: 2
-          slurm_mem: 16GB
-
       - label: "MPI Utilities unit tests"
         key: "utilities_mpi_tests"
         command: "srun julia --color=yes --project=test/ test/utilities_tests.jl"
@@ -97,6 +87,7 @@ steps:
         agents:
           slurm_ntasks: 1
           slurm_gres: "gpu:1"
+          slurm_mem: 20GB
 
   - group: "GPU: experiments/ClimaEarth/ unit tests and global bucket"
     steps:
@@ -109,6 +100,33 @@ steps:
           slurm_gres: "gpu:1"
           slurm_mem: 20GB
 
+  - group: "ClimaEarth test"
+    steps:
+      - label: "ClimaEarth test"
+        key: "restarts"
+        command: "julia --color=yes --project=experiments/ClimaEarth/ experiments/ClimaEarth/test/runtests.jl"
+        agents:
+          slurm_mem: 16GB
+
+      - label: "ClimaEarth test GPU"
+        key: "gpu_restarts"
+        command: "julia --color=yes --project=experiments/ClimaEarth/ experiments/ClimaEarth/test/restart.jl"
+        env:
+          CLIMACOMMS_DEVICE: "CUDA"
+        agents:
+          slurm_mem: 24GB
+          slurm_gpus: 1
+
+      - label: "MPI restarts"
+        key: "mpi_restarts"
+        command: "srun julia --color=yes --project=experiments/ClimaEarth/ experiments/ClimaEarth/test/restart.jl"
+        env:
+          CLIMACOMMS_CONTEXT: "MPI"
+        timeout_in_minutes: 120
+        agents:
+          slurm_ntasks: 2
+          slurm_mem: 24GB
+
   - group: "Integration Tests"
     steps:
       # SLABPLANET EXPERIMENTS
@@ -218,7 +236,7 @@ steps:
           CLIMACOMMS_CONTEXT: "MPI"
         agents:
           slurm_ntasks: 4
-          slurm_mem_per_cpu: 8GB
+          slurm_mem_per_cpu: 12GB
 
       # short high-res performance test
       - label: "Unthreaded AMIP FINE" # also reported by longruns with a flame graph

diff --git a/NEWS.md b/NEWS.md
@@ -6,6 +6,20 @@ ClimaCoupler.jl Release Notes
 
 ### ClimaCoupler features
 
+#### Restart simulations with JLD2 files PR[#1179](https://github.com/CliMA/ClimaCoupler.jl/pull/1179)
+
+`ClimaCoupler` can now use `JLD2` files to save state and cache for its model
+component, allowing it to restart from saved checkpoints. Some restrictions
+apply:
+
+- The number of MPI processes has to remain the same across checkpoints
+- Restart files are generally not portable across machines
+- Adding/changing new component models will probably require adding/changing code
+
+Please, refer to the
+[documentation](https://clima.github.io/ClimaCoupler.jl/dev/checkpointer/) for
+more information.
+
 #### Remove extra `get_field` functions PR[#1203](https://github.com/CliMA/ClimaCoupler.jl/pull/1203)
 Removes the `get_field` functions for `air_density` for all models, which
 were unused except for the `BucketSimulation` method, which is replaced by a

diff --git a/Project.toml b/Project.toml
@@ -8,6 +8,7 @@ ClimaComms = "3a4d1b5c-c61d-41fd-a00a-5873ba7a1b0d"
 ClimaCore = "d414da3d-4745-48bb-8d80-42e94e092884"
 ClimaUtilities = "b3f4f4ca-9299-4f7f-bd9b-81e1242a7513"
 Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
+JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
 Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
 SciMLBase = "0bca4576-84f4-4d90-8ffe-ffa030f20462"
 StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
@@ -16,9 +17,10 @@ Thermodynamics = "b60c26fb-14c3-4610-9d3e-2d17fe7ff00c"
 
 [compat]
 ClimaComms = "0.6.2"
-ClimaCore = "0.14.23"
+ClimaCore = "0.14.25"
 ClimaUtilities = "0.1.22"
 Dates = "1"
+JLD2 = "0.5.11"
 Logging = "1"
 SciMLBase = "2.11"
 StaticArrays = "1.6"

diff --git a/docs/src/checkpointer.md b/docs/src/checkpointer.md
@@ -1,12 +1,136 @@
 # Checkpointer
 
-This module contains general functions for logging the model states and restarting simulations. The `Checkpointer` uses `ClimaCore.InputOutput` infrastructure, which allows it to handle arbitrarily distributed logging and restart setups.
+## How to save and restart from checkpoints
+
+`ClimaCoupler` supports saving and reading simulation checkpoints. This is
+useful to split a long simulation into smaller, more manageable chunks.
+
+Checkpoints are a mix of HDF5 and JLD2 files and are typically saved in a
+`checkpoints` folder in the simulation output. See
+[`Utilities.setup_output_dirs`](@ref) for more information.
+
+!!! known limitations
+
+    - The number of MPI processes has to remain the same across checkpoints
+    - Restart files are generally not portable across machines
+    - Adding/changing new component models will probably require adding/changing code
+
+### Saving checkpoints
+
+If you are running a model (such as AMIP), chances are that you can enable
+checkpointing just by setting a command-line argument; The `checkpoint_dt`
+option controls how frequently a checkpoint should be produced.
+
+If your model does not come with this option already, you can checkpoint the
+simulation by adding a callback that calls the
+[`Checkpointer.checkpoint_sims`](@ref) function.
+
+For example, to add a callback to checkpoint every hour of simulated time,
+assuming you have a `start_date`
+```julia
+import Dates
+
+import ClimaCoupler: Checkpointer, TimeManager
+import ClimaDiagnostics.Schedules: EveryCalendarDtSchedule 
+
+schedule = EveryCalendarDtSchedule(Dates.Hour(1); start_date)
+checkpoint_callback = TimeManager.Callback(schedule_checkpoint, Checkpointer.checkpoint_sims)
+
+# In the coupling loop:
+TimeManager.maybe_trigger_callback(checkpoint_callback, coupled_simulation, time)
+```
+
+### Reading checkpoints
+
+There are two ways to restart a simulation from checkpoints. By default,
+`ClimaCoupler` tries finding suitable checkpoints and automatically use them.
+Alternatively, you can specify a directory `restart_dir` and a simulation time
+`restart_t` and restart from files saved in the given directory at the given
+time. If the model you are running supports writing checkpoints via command-line
+argument, it will probably also support reading them. In this case, the
+arguments `restart_dir` and `restart_t` identify the path of the top level
+directory containing all the checkpoint files and the simulated times in second.
+
+If the model does not support directly reading a checkpoint, the `Checkpointer`
+module provides a straightforward way to add this feature.
+[`Checkpointer.restart!`](@ref) takes a coupled simulation, a `restart_dir`, and
+a `restart_t` and overwrites the content of the coupled simulation with what is
+in the checkpoint. 
+
+## Developer notes
+
+In theory, the state of the component models should fully determine the state of
+the coupled simulation and one should be able to restart a coupled simulation
+just by using the states of the component models. Unfortunately, this is
+currently not the case in `ClimaCoupler`. The main reason for this is the
+complex interdependencies between component models and within `ClimaAtmos` which
+make the initialization step inconsistent. For example, in a coupled simulation,
+the surface albedo should be determined by the surface models and used by the
+atmospheric model for radiation transfer, but `ClimaAtmos` also tries to set the
+surface albedo (since it has to do so when run in standalone mode). In addition
+to this, `ClimaAtmos` has a large cache that has internal interdependencies that
+are hard to disentangle, and changing a field might require changing some other
+field in a different part of the cache. As a result, it is not easy for
+`ClimaCoupler` to consistently do initialization from a cold state. To conclude,
+restarting a simulation exclusively using the states of the component models is
+currently impossible.
+
+Given that restarting a simulation from the state is impossible, `ClimaCoupler`
+needs to save the states and the caches. Let us review how we use
+`ClimaCore.InputOutput` and `JLD2` package to accomplish this.
+
+`ClimaCore.InputOutput` provides a loss-less way to save the content of certain
+`ClimaCore` objects to HDF5 files. Objects saved in this way are not tied to a
+particular computing device or configuration. When running with MPI,
+`ClimaCore.InputOutput` are also efficiently written in parallel.
+
+Unfortunately, `ClimaCore.InputOutput` only supports certain objects, such as
+`Field`s and `Space`s, but the cache in component models is more complex than
+this and contains complex objects with highly stateful quantities (e.g., C
+pointers). Because of this, model states are saved to HDF5 but caches must be
+saved to JLD2 files.
+
+`JLD2` allows us to save more complex objects without writing specific
+serialization methods for every struct. `JLD2` allows us to take a big step
+forward, but there are still several challenges that need to be solved:
+1. `JLD2` does not support CUDA natively. To go around this, we have to move
+  everything onto the CPU first. Then, when the data is read back, we have to
+  move it back to the GPU.
+2. `JLD2` does not support MPI natively. To go around this, each process writes
+  its `jld2` checkpoint and reads it back. This introduces the constraint that
+  the number of MPI processes cannot change across restarts.
+3. Some quantities are best not saved and read (for example, anything with
+  pointers). For this, we write a recursive function that traverses the cache
+  and only restores quantities of a certain type (typically, `ClimaCore`
+  objects)
+
+Point 3. adds significant amount of code and requires component models to
+specify how their cache has to be restored.
+
+If you are adding a component model with a cache, you have to extend the
+```
+Checkpointer.get_model_cache
+Checkpointer.restore_cache!
+```
+methods. 
+
+`ClimaCoupler` moves objects to the CPU with `Adapt(Array, x)`. `Adapt`
+traverses the object recursively, and proper `Adapt` methods have to be defined
+for every object involved in the chain. The easiest way to do this is using the
+`Adapt.@adapt_structure` macro, which defines a recursive Adapt for the given
+object.
+
+Types to watch for:
+- `MPI` related objects (e.g., `MPICommsContext`)
+- `TimeVaryingInputs` (because they contain `NCDatasets`, which contain pointers
+  to files)
 
 ## Checkpointer API
 
 ```@docs
     ClimaCoupler.Checkpointer.get_model_prog_state
-    ClimaCoupler.Checkpointer.restart_model_state!
-    ClimaCoupler.Checkpointer.checkpoint_model_state
+    ClimaCoupler.Checkpointer.get_model_cache
+    ClimaCoupler.Checkpointer.restart!
     ClimaCoupler.Checkpointer.checkpoint_sims
+    ClimaCoupler.Checkpointer.t_start_from_checkpoint
 ```