-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify restart functionality #135
Conversation
Move actual implementation of restart functionality inside setup_moment_kinetics!() - this also fixes a bug where `iblock_index[]` was not initialised when `get_backup_filename()` needed to use it. This allows the `restart_moment_kinetics()` function to be removed. It is replaced by a couple of arguments to run_moment_kinetics. The `restart_moment_kinetics.jl` script is removed. Now either pass the `--restart` flag to `run_moment_kinetics()` to restart from the latest file in the run directory, or pass the output file to restart from as the second positional argument.
Not tested this yet. @mrhardman does it fix the bug you were running into? |
Edit: This seems to be due to inconsistency between src/command_line_options.jl and run_moment_kinetics.jl -- will suggest changes. Attempting to start a normal simulation, I find a bug ERROR: LoadError: KeyError: key "restart_file" not found |
src/moment_kinetics.jl
Outdated
options = get_options() | ||
inputfile = options["inputfile"] | ||
restart = options["restart"] | ||
if options["restart_file"] !== nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace "restart_file" with "restartfile" to be consistent with src/command_line_options.jl line 18
src/moment_kinetics.jl
Outdated
if options["restart_file"] !== nothing | ||
restart = options["restartfile"] | ||
end | ||
restart_time_index = options["restart_time_index"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace "restart_time_index" with "restart-time-index" to be consistent with line 35 of src/command_line_options.jl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small typos in moment_kinetics.jl need correcting as indicated in comments. The parallel_io = false options now seem to be working for me if I use only a single shared memory region. Otherwise, for multiple shared memory regions using the distribution memory for z, the simulation crashes before the time advance with a bounds error but no stacktrace. In my own branch, the commits to fix the small bugs here are 8de082d. A comment is modified in ce1b134. It would be useful to add a full example command to clarify the instructions on using the restart functionality in the manual.
src/moment_kinetics.jl
Outdated
else | ||
error("Unrecognized binary_format '$binary_format'") | ||
end | ||
restart = glob(joinpath(output_dir, run_name * ".dfns*." * ext))[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line appears to be giving BoundsError when the number of shared memory regions is greater that 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this code make any allowances for multiple .h5 or .cdf files to be copied and read? I cannot see how it does here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying
println(joinpath(output_dir, run_name * ".dfns*." * ext))
println(glob(joinpath(output_dir, run_name * ".dfns*." * ext)))
to restart an example wall-bc_cheb_single_ion.toml with 8 shared memory regions (z_nelement/z_nelement_local = 8) and parallel_io = false
Gets
runs/wall-bc_cheb_single_ion/wall-bc_cheb_single_ion.dfns*.h5
String[]
It's pretty clear that this code is broken for some reason in this case. What should it be doing? What does the function `glob' do? I cannot find the definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glob()
finds files matching a pattern (name comes from *nix shell, e.g. https://tldp.org/LDP/abs/html/globbingref.html). The bounds error usually means there is no file matching the pattern it is looking for.
Some of the error messages can be a bit unhelpful. I tried to improve things a bit in a few commits in #128. Those would need merging if we merge this PR first - maybe I should just cherry-pick them into this branch...
src/moment_kinetics.jl
Outdated
# overwritten. | ||
dfns_filename, backup_dfns_filename, parallel_io, moments_filename, | ||
backup_moments_filename, backup_prefix_iblock = | ||
get_backup_filename(restart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer not to use `restart' as a logical and as a filename here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can agree it's maybe not the clearest to change the value of restart
that was passed in as an argument. Will make a separate restart_filename
variable that gets set appropriately.
Sometimes it is useful to restart one run from a separate run (whose output is in a different directory). In this case there is no need to move the output files that are being read from, as they would not be overwritten by the new run. So do not move them, to keep the directory of the run being restarted from in its original state.
Using an input file is preferred now, as changing options by editing defaults in moment_kinetics_input.toml forces Julia to redo precompilation of the project.
0144381
to
7ba8558
Compare
@mrhardman I think I've addressed all your comments now. |
PS @mrhardman I don't think you've pushed the commits you mentioned in one of your comments. If there was a comment that needed updating, would you please just push the change to this branch? Thanks! |
Sure! I was trying to debug these features by merging your changes into my own PR (generalised-chodura-diagnostic), since the simulations that I wanted to restart use a feature that is only in that branch at the moment. Switching branches seems to sometimes confuse my Julia compiler, so I will leave comment here directly. |
My comment was just that this line
|
Also move "--restart" and "--restart-time-index" so they are not under the `# Options for tests` heading in the source code.
I am happy now that the restart feature is working with parallel_io = false. My tests were carried out after merging this branch with #110. I have now pushed my commits there -- I didn't have to make any further modifications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, the restarting feature now works for parallel_io = false, and the code is easier to follow : )
Move actual implementation of restart functionality inside setup_moment_kinetics!() - this also fixes a bug where
iblock_index[]
was not initialised whenget_backup_filename()
needed to use it.This allows the
restart_moment_kinetics()
function to be removed. It is replaced by a couple of arguments to run_moment_kinetics.The
restart_moment_kinetics.jl
script is removed. Now either pass the--restart
flag torun_moment_kinetics()
to restart from the latest file in the run directory, or pass the output file to restart from as the second positional argument.