Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify restart functionality #135

Merged
merged 9 commits into from
Sep 20, 2023
Merged

Conversation

johnomotani
Copy link
Collaborator

Move actual implementation of restart functionality inside setup_moment_kinetics!() - this also fixes a bug where iblock_index[] was not initialised when get_backup_filename() needed to use it.

This allows the restart_moment_kinetics() function to be removed. It is replaced by a couple of arguments to run_moment_kinetics.

The restart_moment_kinetics.jl script is removed. Now either pass the --restart flag to run_moment_kinetics() to restart from the latest file in the run directory, or pass the output file to restart from as the second positional argument.

Move actual implementation of restart functionality inside
setup_moment_kinetics!() - this also fixes a bug where `iblock_index[]`
was not initialised when `get_backup_filename()` needed to use it.

This allows the `restart_moment_kinetics()` function to be removed. It
is replaced by a couple of arguments to run_moment_kinetics.

The `restart_moment_kinetics.jl` script is removed. Now either pass the
`--restart` flag to `run_moment_kinetics()` to restart from the latest
file in the run directory, or pass the output file to restart from as
the second positional argument.
@johnomotani johnomotani added bug Something isn't working enhancement New feature or request labels Sep 19, 2023
@johnomotani
Copy link
Collaborator Author

Not tested this yet. @mrhardman does it fix the bug you were running into?

@mrhardman
Copy link
Collaborator

mrhardman commented Sep 20, 2023

Edit: This seems to be due to inconsistency between src/command_line_options.jl and run_moment_kinetics.jl -- will suggest changes.

Attempting to start a normal simulation, I find a bug

ERROR: LoadError: KeyError: key "restart_file" not found
Stacktrace:
[1] getindex(h::Dict{String, Any}, key::String)
@ Base ./dict.jl:481
[2] run_moment_kinetics()
@ moment_kinetics /src/moment_kinetics.jl:162
[3] top-level scope
@ /run_moment_kinetics.jl:7
in expression starting at run_moment_kinetics.jl:7

options = get_options()
inputfile = options["inputfile"]
restart = options["restart"]
if options["restart_file"] !== nothing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace "restart_file" with "restartfile" to be consistent with src/command_line_options.jl line 18

if options["restart_file"] !== nothing
restart = options["restartfile"]
end
restart_time_index = options["restart_time_index"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace "restart_time_index" with "restart-time-index" to be consistent with line 35 of src/command_line_options.jl

Copy link
Collaborator

@mrhardman mrhardman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small typos in moment_kinetics.jl need correcting as indicated in comments. The parallel_io = false options now seem to be working for me if I use only a single shared memory region. Otherwise, for multiple shared memory regions using the distribution memory for z, the simulation crashes before the time advance with a bounds error but no stacktrace. In my own branch, the commits to fix the small bugs here are 8de082d. A comment is modified in ce1b134. It would be useful to add a full example command to clarify the instructions on using the restart functionality in the manual.

else
error("Unrecognized binary_format '$binary_format'")
end
restart = glob(joinpath(output_dir, run_name * ".dfns*." * ext))[1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line appears to be giving BoundsError when the number of shared memory regions is greater that 1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this code make any allowances for multiple .h5 or .cdf files to be copied and read? I cannot see how it does here.

Copy link
Collaborator

@mrhardman mrhardman Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying

println(joinpath(output_dir, run_name * ".dfns*." * ext))
println(glob(joinpath(output_dir, run_name * ".dfns*." * ext)))

to restart an example wall-bc_cheb_single_ion.toml with 8 shared memory regions (z_nelement/z_nelement_local = 8) and parallel_io = false

Gets

runs/wall-bc_cheb_single_ion/wall-bc_cheb_single_ion.dfns*.h5
String[]

It's pretty clear that this code is broken for some reason in this case. What should it be doing? What does the function `glob' do? I cannot find the definition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glob() finds files matching a pattern (name comes from *nix shell, e.g. https://tldp.org/LDP/abs/html/globbingref.html). The bounds error usually means there is no file matching the pattern it is looking for.

Some of the error messages can be a bit unhelpful. I tried to improve things a bit in a few commits in #128. Those would need merging if we merge this PR first - maybe I should just cherry-pick them into this branch...

# overwritten.
dfns_filename, backup_dfns_filename, parallel_io, moments_filename,
backup_moments_filename, backup_prefix_iblock =
get_backup_filename(restart)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer not to use `restart' as a logical and as a filename here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can agree it's maybe not the clearest to change the value of restart that was passed in as an argument. Will make a separate restart_filename variable that gets set appropriately.

Sometimes it is useful to restart one run from a separate run (whose
output is in a different directory). In this case there is no need to
move the output files that are being read from, as they would not be
overwritten by the new run. So do not move them, to keep the directory
of the run being restarted from in its original state.
Using an input file is preferred now, as changing options by editing
defaults in moment_kinetics_input.toml forces Julia to redo
precompilation of the project.
@johnomotani johnomotani force-pushed the restarting-simplification branch from 0144381 to 7ba8558 Compare September 20, 2023 11:20
@johnomotani
Copy link
Collaborator Author

@mrhardman I think I've addressed all your comments now.

@johnomotani
Copy link
Collaborator Author

PS @mrhardman I don't think you've pushed the commits you mentioned in one of your comments. If there was a comment that needed updating, would you please just push the change to this branch? Thanks!

@mrhardman
Copy link
Collaborator

PS @mrhardman I don't think you've pushed the commits you mentioned in one of your comments. If there was a comment that needed updating, would you please just push the change to this branch? Thanks!

Sure! I was trying to debug these features by merging your changes into my own PR (generalised-chodura-diagnostic), since the simulations that I wanted to restart use a feature that is only in that branch at the moment. Switching branches seems to sometimes confuse my Julia compiler, so I will leave comment here directly.

@mrhardman
Copy link
Collaborator

My comment was just that this line

help = "Name of NetCDF file to restart from"
should mention HDF5.

Also move "--restart" and "--restart-time-index" so they are not under
the `# Options for tests` heading in the source code.
@mrhardman
Copy link
Collaborator

I am happy now that the restart feature is working with parallel_io = false. My tests were carried out after merging this branch with #110. I have now pushed my commits there -- I didn't have to make any further modifications.

Copy link
Collaborator

@mrhardman mrhardman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, the restarting feature now works for parallel_io = false, and the code is easier to follow : )

@johnomotani johnomotani merged commit 808468b into master Sep 20, 2023
@johnomotani johnomotani deleted the restarting-simplification branch September 20, 2023 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants