-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support identical restarts with JLD2 files #1179
base: main
Are you sure you want to change the base?
Conversation
47079b6
to
bc99b9e
Compare
b615ef3
to
048602e
Compare
4f06208
to
f839199
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had a chance to look at the atmos/bucket or test files yet, but here's my initial review.
One question that came to mind is why Checkpointer.get_model_cache
and Checkpointer.restart_model_cache!
don't need to be extended for the non-atmos/land models we're running (e.g. PrescribedIceSimulation
, which does have a cache). I would expect that we need those methods, but the tests seem to pass without them.
@@ -22,4 +22,4 @@ include("setup_run.jl") | |||
config_file = parse_commandline(argparse_settings())["config_file"] | |||
|
|||
# Set up and run the coupled simulation | |||
setup_and_run(config_file) | |||
cs = setup_and_run(config_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why return cs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I need a way to access the fields
4040c80
to
9cab58c
Compare
@@ -90,7 +90,6 @@ and exchanges combined fields and calculates fluxes using the selected turbulent | |||
Note that we want to implement this in a dispatchable function to allow for | |||
other forms of timestepping (e.g. leapfrog). | |||
""" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this because docstrings have to next to their functions
73ba81c
to
c5a2b88
Compare
Great question! The answer is that the PrescribedIce/Ocean are not evolving independently. When uncoupled, they don't really have a cache because everything is determined by the time. When coupled, they don't really have a cache either, because all the fields that are not default are updated using values from the coupled_fields, which is stored to disk and read back. |
ad2dbaa
to
444a609
Compare
This commit implements a way to restart simulations by saving both state and caches of component models, as well as the coupler fields. Given that caches are complex object, I implemented this using JLD2 files. The challenges with JLD2 files are that: - they are not MPI compatible, - they are not GPU compatible. For this reason, I have to move everything to the CPU, and have each process write to its own output. This adds a restriction: only the same number of MPI process (and the same machine) can be used for restarts. In addition to this, this approach requires component models to implement their functions to restore their caches. Something that can be improved in the future is that, ClimaAtmos is currently producing two checkpoints, one independently, and one from ClimaCoupler. This should not be needed, but it is currently needed because there's no other way to start ClimaAtmos at a different time. The other problem here is that the MPI test occasionally hangs (as it does in ClimaAtmos).
444a609
to
9673868
Compare
This commit implements a way to restart simulations by saving both state and caches of component models, as well as the coupler fields.
Given that caches are complex object, I implemented this using JLD2 files.
The challenges with JLD2 files are that:
For this reason, I have to move everything to the CPU, and have each process write to its own output. This adds a restriction: only the same number of MPI process (and the same machine) can be used for restarts.
In addition to this, this approach requires component models to implement their functions to restore their caches.
Something that can be improved in the future is that, ClimaAtmos is currently producing two checkpoints, one independently, and one from ClimaCoupler. This should not be needed, but it is currently needed because there's no other way to start ClimaAtmos at a different time.
More over, when restarting, ClimaAtmos
t_end
will be from thet_start
, which is different from climacouplert_end
.The other problem here is that the MPI test occasionally hangs (as it does in ClimaAtmos).p
Closes #1063