Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support identical restarts with JLD2 files #1179

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

Sbozzolo
Copy link
Member

@Sbozzolo Sbozzolo commented Feb 6, 2025

This commit implements a way to restart simulations by saving both state and caches of component models, as well as the coupler fields.

Given that caches are complex object, I implemented this using JLD2 files.

The challenges with JLD2 files are that:

  • they are not MPI compatible,
  • they are not GPU compatible.

For this reason, I have to move everything to the CPU, and have each process write to its own output. This adds a restriction: only the same number of MPI process (and the same machine) can be used for restarts.

In addition to this, this approach requires component models to implement their functions to restore their caches.

Something that can be improved in the future is that, ClimaAtmos is currently producing two checkpoints, one independently, and one from ClimaCoupler. This should not be needed, but it is currently needed because there's no other way to start ClimaAtmos at a different time.

More over, when restarting, ClimaAtmos t_end will be from the t_start, which is different from climacoupler t_end.

The other problem here is that the MPI test occasionally hangs (as it does in ClimaAtmos).p

Closes #1063

@Sbozzolo Sbozzolo force-pushed the gb/checkpoint3 branch 29 times, most recently from b615ef3 to 048602e Compare February 24, 2025 23:30
@Sbozzolo Sbozzolo force-pushed the gb/checkpoint3 branch 3 times, most recently from 4f06208 to f839199 Compare February 25, 2025 01:39
Copy link
Member

@juliasloan25 juliasloan25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had a chance to look at the atmos/bucket or test files yet, but here's my initial review.

One question that came to mind is why Checkpointer.get_model_cache and Checkpointer.restart_model_cache! don't need to be extended for the non-atmos/land models we're running (e.g. PrescribedIceSimulation, which does have a cache). I would expect that we need those methods, but the tests seem to pass without them.

@@ -22,4 +22,4 @@ include("setup_run.jl")
config_file = parse_commandline(argparse_settings())["config_file"]

# Set up and run the coupled simulation
setup_and_run(config_file)
cs = setup_and_run(config_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why return cs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I need a way to access the fields

@Sbozzolo Sbozzolo force-pushed the gb/checkpoint3 branch 4 times, most recently from 4040c80 to 9cab58c Compare February 26, 2025 00:15
@@ -90,7 +90,6 @@ and exchanges combined fields and calculates fluxes using the selected turbulent
Note that we want to implement this in a dispatchable function to allow for
other forms of timestepping (e.g. leapfrog).
"""

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this because docstrings have to next to their functions

@Sbozzolo Sbozzolo force-pushed the gb/checkpoint3 branch 2 times, most recently from 73ba81c to c5a2b88 Compare February 26, 2025 00:25
@Sbozzolo
Copy link
Member Author

Sbozzolo commented Feb 26, 2025

I haven't had a chance to look at the atmos/bucket or test files yet, but here's my initial review.

One question that came to mind is why Checkpointer.get_model_cache and Checkpointer.restart_model_cache! don't need to be extended for the non-atmos/land models we're running (e.g. PrescribedIceSimulation, which does have a cache). I would expect that we need those methods, but the tests seem to pass without them.

Great question! The answer is that the PrescribedIce/Ocean are not evolving independently. When uncoupled, they don't really have a cache because everything is determined by the time. When coupled, they don't really have a cache either, because all the fields that are not default are updated using values from the coupled_fields, which is stored to disk and read back.

@Sbozzolo Sbozzolo force-pushed the gb/checkpoint3 branch 12 times, most recently from ad2dbaa to 444a609 Compare February 26, 2025 23:36
This commit implements a way to restart simulations by saving both state
and caches of component models, as well as the coupler fields.

Given that caches are complex object, I implemented this using JLD2
files.

The challenges with JLD2 files are that:
- they are not MPI compatible,
- they are not GPU compatible.

For this reason, I have to move everything to the CPU, and have each
process write to its own output. This adds a restriction: only the same
number of MPI process (and the same machine) can be used for restarts.

In addition to this, this approach requires component models to
implement their functions to restore their caches.

Something that can be improved in the future is that, ClimaAtmos is
currently producing two checkpoints, one independently, and one from
ClimaCoupler. This should not be needed, but it is currently needed
because there's no other way to start ClimaAtmos at a different time.

The other problem here is that the MPI test occasionally hangs (as it
does in ClimaAtmos).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants