Support identical restarts with JLD2 files #1179

Sbozzolo · 2025-02-06T20:27:37Z

This commit implements a way to restart simulations by saving both state and caches of component models, as well as the coupler fields.

Given that caches are complex object, I implemented this using JLD2 files.

The challenges with JLD2 files are that:

they are not MPI compatible,
they are not GPU compatible.

For this reason, I have to move everything to the CPU, and have each process write to its own output. This adds a restriction: only the same number of MPI process (and the same machine) can be used for restarts.

In addition to this, this approach requires component models to implement their functions to restore their caches.

Something that can be improved in the future is that, ClimaAtmos is currently producing two checkpoints, one independently, and one from ClimaCoupler. This should not be needed, but it is currently needed because there's no other way to start ClimaAtmos at a different time.

More over, when restarting, ClimaAtmos t_end will be from the t_start, which is different from climacoupler t_end.

The other problem here is that the MPI test occasionally hangs (as it does in ClimaAtmos).p

Closes #1063

juliasloan25

I haven't had a chance to look at the atmos/bucket or test files yet, but here's my initial review.

One question that came to mind is why Checkpointer.get_model_cache and Checkpointer.restart_model_cache! don't need to be extended for the non-atmos/land models we're running (e.g. PrescribedIceSimulation, which does have a cache). I would expect that we need those methods, but the tests seem to pass without them.

docs/src/checkpointer.md

experiments/ClimaEarth/setup_run.jl

juliasloan25 · 2025-02-25T06:27:49Z

experiments/ClimaEarth/run_amip.jl

@@ -22,4 +22,4 @@ include("setup_run.jl")
 config_file = parse_commandline(argparse_settings())["config_file"]

 # Set up and run the coupled simulation
-setup_and_run(config_file)
+cs = setup_and_run(config_file)


why return cs?

Because I need a way to access the fields

docs/src/checkpointer.md

Sbozzolo · 2025-02-26T00:17:00Z

experiments/ClimaEarth/setup_run.jl

@@ -90,7 +90,6 @@ and exchanges combined fields and calculates fluxes using the selected turbulent
 Note that we want to implement this in a dispatchable function to allow for
 other forms of timestepping (e.g. leapfrog).
 """
-


Removed this because docstrings have to next to their functions

Sbozzolo · 2025-02-26T00:27:55Z

I haven't had a chance to look at the atmos/bucket or test files yet, but here's my initial review.

One question that came to mind is why Checkpointer.get_model_cache and Checkpointer.restart_model_cache! don't need to be extended for the non-atmos/land models we're running (e.g. PrescribedIceSimulation, which does have a cache). I would expect that we need those methods, but the tests seem to pass without them.

Great question! The answer is that the PrescribedIce/Ocean are not evolving independently. When uncoupled, they don't really have a cache because everything is determined by the time. When coupled, they don't really have a cache either, because all the fields that are not default are updated using values from the coupled_fields, which is stored to disk and read back.

This commit implements a way to restart simulations by saving both state and caches of component models, as well as the coupler fields. Given that caches are complex object, I implemented this using JLD2 files. The challenges with JLD2 files are that: - they are not MPI compatible, - they are not GPU compatible. For this reason, I have to move everything to the CPU, and have each process write to its own output. This adds a restriction: only the same number of MPI process (and the same machine) can be used for restarts. In addition to this, this approach requires component models to implement their functions to restore their caches. Something that can be improved in the future is that, ClimaAtmos is currently producing two checkpoints, one independently, and one from ClimaCoupler. This should not be needed, but it is currently needed because there's no other way to start ClimaAtmos at a different time. The other problem here is that the MPI test occasionally hangs (as it does in ClimaAtmos).

Sbozzolo force-pushed the gb/checkpoint3 branch from 47079b6 to bc99b9e Compare February 6, 2025 20:28

Sbozzolo force-pushed the gb/checkpoint3 branch 29 times, most recently from b615ef3 to 048602e Compare February 24, 2025 23:30

Sbozzolo requested a review from juliasloan25 February 25, 2025 00:11

Sbozzolo force-pushed the gb/checkpoint3 branch 3 times, most recently from 4f06208 to f839199 Compare February 25, 2025 01:39

juliasloan25 reviewed Feb 25, 2025

View reviewed changes

Sbozzolo force-pushed the gb/checkpoint3 branch 4 times, most recently from 4040c80 to 9cab58c Compare February 26, 2025 00:15

Sbozzolo commented Feb 26, 2025

View reviewed changes

Sbozzolo force-pushed the gb/checkpoint3 branch 2 times, most recently from 73ba81c to c5a2b88 Compare February 26, 2025 00:25

Sbozzolo force-pushed the gb/checkpoint3 branch 12 times, most recently from ad2dbaa to 444a609 Compare February 26, 2025 23:36

Sbozzolo force-pushed the gb/checkpoint3 branch from 444a609 to 9673868 Compare February 27, 2025 00:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support identical restarts with JLD2 files #1179

Support identical restarts with JLD2 files #1179

Sbozzolo commented Feb 6, 2025 •

edited

Loading

juliasloan25 left a comment

juliasloan25 Feb 25, 2025

Sbozzolo Feb 25, 2025

Sbozzolo Feb 26, 2025

Sbozzolo commented Feb 26, 2025 •

edited

Loading

Support identical restarts with JLD2 files #1179

Are you sure you want to change the base?

Support identical restarts with JLD2 files #1179

Conversation

Sbozzolo commented Feb 6, 2025 • edited Loading

juliasloan25 left a comment

Choose a reason for hiding this comment

juliasloan25 Feb 25, 2025

Choose a reason for hiding this comment

Sbozzolo Feb 25, 2025

Choose a reason for hiding this comment

Sbozzolo Feb 26, 2025

Choose a reason for hiding this comment

Sbozzolo commented Feb 26, 2025 • edited Loading

Sbozzolo commented Feb 6, 2025 •

edited

Loading

Sbozzolo commented Feb 26, 2025 •

edited

Loading