Skip to content

Commit

Permalink
Add restart test and support for restarts
Browse files Browse the repository at this point in the history
This commit implements a way to restart simulations by saving both state
and caches of component models, as well as the coupler fields.

Given that caches are complex object, I implemented this using JLD2
files.

The challenges with JLD2 files are that:
- they are not MPI compatible,
- they are not GPU compatible.

For this reason, I have to move everything to the CPU, and have each
process write to its own output. This adds a restriction: only the same
number of MPI process (and the same machine) can be used for restarts.

In addition to this, this approach requires component models to
implement their functions to restore their caches.

Something that can be improved in the future is that, ClimaAtmos is
currently producing two checkpoints, one independently, and one from
ClimaCoupler. This should not be needed, but it is currently needed
because there's no other way to start ClimaAtmos at a different time.

The other problem here is that the MPI test occasionally hangs (as it
does in ClimaAtmos).
  • Loading branch information
Sbozzolo committed Feb 26, 2025
1 parent 9839b78 commit ad2dbaa
Show file tree
Hide file tree
Showing 26 changed files with 914 additions and 199 deletions.
40 changes: 29 additions & 11 deletions .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,16 +67,6 @@ steps:
- group: "Unit Tests"
steps:

- label: "MPI Checkpointer unit tests"
key: "checkpointer_mpi_tests"
command: "srun julia --color=yes --project=test/ test/mpi_tests/checkpointer_mpi_tests.jl"
timeout_in_minutes: 20
env:
CLIMACOMMS_CONTEXT: "MPI"
agents:
slurm_ntasks: 2
slurm_mem: 16GB

- label: "MPI Utilities unit tests"
key: "utilities_mpi_tests"
command: "srun julia --color=yes --project=test/ test/utilities_tests.jl"
Expand All @@ -97,6 +87,7 @@ steps:
agents:
slurm_ntasks: 1
slurm_gres: "gpu:1"
slurm_mem: 20GB

- group: "GPU: experiments/ClimaEarth/ unit tests and global bucket"
steps:
Expand All @@ -109,6 +100,33 @@ steps:
slurm_gres: "gpu:1"
slurm_mem: 20GB

- group: "ClimaEarth test"
steps:
- label: "ClimaEarth test"
key: "restarts"
command: "julia --color=yes --project=experiments/ClimaEarth/ experiments/ClimaEarth/test/runtests.jl"
agents:
slurm_mem: 16GB

- label: "ClimaEarth test GPU"
key: "gpu_restarts"
command: "julia --color=yes --project=experiments/ClimaEarth/ experiments/ClimaEarth/test/restart.jl"
env:
CLIMACOMMS_DEVICE: "CUDA"
agents:
slurm_mem: 24GB
slurm_gpus: 1

- label: "MPI restarts"
key: "mpi_restarts"
command: "srun julia --color=yes --project=experiments/ClimaEarth/ experiments/ClimaEarth/test/restart.jl"
env:
CLIMACOMMS_CONTEXT: "MPI"
timeout_in_minutes: 120
agents:
slurm_ntasks: 2
slurm_mem: 24GB

- group: "Integration Tests"
steps:
# SLABPLANET EXPERIMENTS
Expand Down Expand Up @@ -218,7 +236,7 @@ steps:
CLIMACOMMS_CONTEXT: "MPI"
agents:
slurm_ntasks: 4
slurm_mem_per_cpu: 8GB
slurm_mem_per_cpu: 12GB

# short high-res performance test
- label: "Unthreaded AMIP FINE" # also reported by longruns with a flame graph
Expand Down
14 changes: 14 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,20 @@ ClimaCoupler.jl Release Notes

### ClimaCoupler features

#### Restart simulations with JLD2 files PR[#1179](https://github.com/CliMA/ClimaCoupler.jl/pull/1179)

`ClimaCoupler` can now use `JLD2` files to save state and cache for its model
component, allowing it to restart from saved checkpoints. Some restrictions
apply:

- The number of MPI processes has to remain the same across checkpoints
- Restart files are generally not portable across machines
- Adding/changing new component models will probably require adding/changing code

Please, refer to the
[documentation](https://clima.github.io/ClimaCoupler.jl/dev/checkpointer/) for
more information.

#### Remove extra `get_field` functions PR[#1203](https://github.com/CliMA/ClimaCoupler.jl/pull/1203)
Removes the `get_field` functions for `air_density` for all models, which
were unused except for the `BucketSimulation` method, which is replaced by a
Expand Down
4 changes: 3 additions & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ ClimaComms = "3a4d1b5c-c61d-41fd-a00a-5873ba7a1b0d"
ClimaCore = "d414da3d-4745-48bb-8d80-42e94e092884"
ClimaUtilities = "b3f4f4ca-9299-4f7f-bd9b-81e1242a7513"
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
SciMLBase = "0bca4576-84f4-4d90-8ffe-ffa030f20462"
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
Expand All @@ -16,9 +17,10 @@ Thermodynamics = "b60c26fb-14c3-4610-9d3e-2d17fe7ff00c"

[compat]
ClimaComms = "0.6.2"
ClimaCore = "0.14.23"
ClimaCore = "0.14.25"
ClimaUtilities = "0.1.22"
Dates = "1"
JLD2 = "0.5.11"
Logging = "1"
SciMLBase = "2.11"
StaticArrays = "1.6"
Expand Down
130 changes: 127 additions & 3 deletions docs/src/checkpointer.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,136 @@
# Checkpointer

This module contains general functions for logging the model states and restarting simulations. The `Checkpointer` uses `ClimaCore.InputOutput` infrastructure, which allows it to handle arbitrarily distributed logging and restart setups.
## How to save and restart from checkpoints

`ClimaCoupler` supports saving and reading simulation checkpoints. This is
useful to split a long simulation into smaller, more manageable chunks.

Checkpoints are a mix of HDF5 and JLD2 files and are typically saved in a
`checkpoints` folder in the simulation output. See
[`Utilities.setup_output_dirs`](@ref) for more information.

!!! known limitations

- The number of MPI processes has to remain the same across checkpoints
- Restart files are generally not portable across machines
- Adding/changing new component models will probably require adding/changing code

### Saving checkpoints

If you are running a model (such as AMIP), chances are that you can enable
checkpointing just by setting a command-line argument; The `checkpoint_dt`
option controls how frequently a checkpoint should be produced.

If your model does not come with this option already, you can checkpoint the
simulation by adding a callback that calls the
[`Checkpointer.checkpoint_sims`](@ref) function.

For example, to add a callback to checkpoint every hour of simulated time,
assuming you have a `start_date`
```julia
import Dates

import ClimaCoupler: Checkpointer, TimeManager
import ClimaDiagnostics.Schedules: EveryCalendarDtSchedule

schedule = EveryCalendarDtSchedule(Dates.Hour(1); start_date)
checkpoint_callback = TimeManager.Callback(schedule_checkpoint, Checkpointer.checkpoint_sims)

# In the coupling loop:
TimeManager.maybe_trigger_callback(checkpoint_callback, coupled_simulation, time)
```

### Reading checkpoints

There are two ways to restart a simulation from checkpoints. By default,
`ClimaCoupler` tries finding suitable checkpoints and automatically use them.
Alternatively, you can specify a directory `restart_dir` and a simulation time
`restart_t` and restart from files saved in the given directory at the given
time. If the model you are running supports writing checkpoints via command-line
argument, it will probably also support reading them. In this case, the
arguments `restart_dir` and `restart_t` identify the path of the top level
directory containing all the checkpoint files and the simulated times in second.

If the model does not support directly reading a checkpoint, the `Checkpointer`
module provides a straightforward way to add this feature.
[`Checkpointer.restart!`](@ref) takes a coupled simulation, a `restart_dir`, and
a `restart_t` and overwrites the content of the coupled simulation with what is
in the checkpoint.

## Developer notes

In theory, the state of the component models should fully determine the state of
the coupled simulation and one should be able to restart a coupled simulation
just by using the states of the component models. Unfortunately, this is
currently not the case in `ClimaCoupler`. The main reason for this is the
complex interdependencies between component models and within `ClimaAtmos` which
make the initialization step inconsistent. For example, in a coupled simulation,
the surface albedo should be determined by the surface models and used by the
atmospheric model for radiation transfer, but `ClimaAtmos` also tries to set the
surface albedo (since it has to do so when run in standalone mode). In addition
to this, `ClimaAtmos` has a large cache that has internal interdependencies that
are hard to disentangle, and changing a field might require changing some other
field in a different part of the cache. As a result, it is not easy for
`ClimaCoupler` to consistently do initialization from a cold state. To conclude,
restarting a simulation exclusively using the states of the component models is
currently impossible.

Given that restarting a simulation from the state is impossible, `ClimaCoupler`
needs to save the states and the caches. Let us review how we use
`ClimaCore.InputOutput` and `JLD2` package to accomplish this.

`ClimaCore.InputOutput` provides a loss-less way to save the content of certain
`ClimaCore` objects to HDF5 files. Objects saved in this way are not tied to a
particular computing device or configuration. When running with MPI,
`ClimaCore.InputOutput` are also efficiently written in parallel.

Unfortunately, `ClimaCore.InputOutput` only supports certain objects, such as
`Field`s and `Space`s, but the cache in component models is more complex than
this and contains complex objects with highly stateful quantities (e.g., C
pointers). Because of this, model states are saved to HDF5 but caches must be
saved to JLD2 files.

`JLD2` allows us to save more complex objects without writing specific
serialization methods for every struct. `JLD2` allows us to take a big step
forward, but there are still several challenges that need to be solved:
1. `JLD2` does not support CUDA natively. To go around this, we have to move
everything onto the CPU first. Then, when the data is read back, we have to
move it back to the GPU.
2. `JLD2` does not support MPI natively. To go around this, each process writes
its `jld2` checkpoint and reads it back. This introduces the constraint that
the number of MPI processes cannot change across restarts.
3. Some quantities are best not saved and read (for example, anything with
pointers). For this, we write a recursive function that traverses the cache
and only restores quantities of a certain type (typically, `ClimaCore`
objects)

Point 3. adds significant amount of code and requires component models to
specify how their cache has to be restored.

If you are adding a component model with a cache, you have to extend the
```
Checkpointer.get_model_cache
Checkpointer.restore_cache!
```
methods.

`ClimaCoupler` moves objects to the CPU with `Adapt(Array, x)`. `Adapt`
traverses the object recursively, and proper `Adapt` methods have to be defined
for every object involved in the chain. The easiest way to do this is using the
`Adapt.@adapt_structure` macro, which defines a recursive Adapt for the given
object.

Types to watch for:
- `MPI` related objects (e.g., `MPICommsContext`)
- `TimeVaryingInputs` (because they contain `NCDatasets`, which contain pointers
to files)

## Checkpointer API

```@docs
ClimaCoupler.Checkpointer.get_model_prog_state
ClimaCoupler.Checkpointer.restart_model_state!
ClimaCoupler.Checkpointer.checkpoint_model_state
ClimaCoupler.Checkpointer.get_model_cache
ClimaCoupler.Checkpointer.restart!
ClimaCoupler.Checkpointer.checkpoint_sims
ClimaCoupler.Checkpointer.t_start_from_checkpoint
```
Loading

0 comments on commit ad2dbaa

Please sign in to comment.