Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parallel I/O for timers #292

Merged
merged 9 commits into from
Nov 13, 2024
Merged

Fix parallel I/O for timers #292

merged 9 commits into from
Nov 13, 2024

Conversation

johnomotani
Copy link
Collaborator

The timers added in #276 introduced errors in parallel I/O. These errors seem to have come from:

  1. The timers were stored in Dicts. Julia Dicts do not have a consistent order, so different processes sometimes tried to create/write variables in different orders. Fixed by using OrderedDict or SortedDict. There was also a potential bug with input options which were also stored in a Dict being created in inconsistent orders, which is also fixed by this PR.
  2. Creating too many variables in the output file. Not sure why this was a problem, but having timers create separate variables for the timers on each process resulted (in 128 process runs) in a large number of variables, and the run hanging while writing output. After converting the output to arrays with size equal to the number of processes, reducing the number of variables, the hang has gone away.

Also changes the global_timer_string output to be a fixed-length string, to avoid repeatedly creating and deleting the variable. Don't know if this was actually part of the problem - suspect not in the end - but the updated version seems a bit nicer.

@johnomotani johnomotani added the bug Something isn't working label Nov 13, 2024
It is important to have AbstractDicts with a deterministic order of
entries for options, etc., so that when writing them to the output files
with parallel I/O, the writes all happen in exactly the same order,
which is necessary for consistency. Failing to do this can cause hangs
when writing output because the HDF5 cache(s) are inconsistent on
different ranks, which can lead to some ranks calling an MPI function
(e.g. MPI.Bcast!()) when others do not.
Not sure whether deleting and re-creating String variables does was
actually causing a problem (suspect now that it was not), but anyway,
requiring that Strings always have the same length when overwriting
seems a bit nicer and may be slightly more efficient.
Convert "global_timer_string" to a fixed-length String (with a
hard-coded length) before (over-)writing. This hopefully avoids HDF5
errors.

Ensure that `global_timer_string` can only contain ASCII
characters.

Reset timers during cleanup rather than at the beginning of
`run_moment_kinetics()`, which helps to keep the timers in a consistent
state, e.g. for tests that do not use `run_moment_kinetics()` but be run
after another run that created timers.
Inconsistent formatting on different ranks makes the description string
different lengths, which causes HDF5 errors.
On ranks other than the root of each shared-memory block, set
`io_moments` and `io_dfns` to `(io_input=io_input)` so that the input
(e.g. `parallel_io` setting) is available on all ranks.
...rather than in a separate group for each rank. This avoids creating a
very large number of variables in the output file when running on many
cores, which seems to help prevent parallel HDF5 errors.
Something has made DataInspector work in all the figures when they are
opened simultaneously - I guess either the change to making
`inspector_label` be defined separately for each plot in the previous
commit, or some update to `Makie`.
`mergewith()` always returns a `Dict`, ignoring the types of its
arguments. We need `recursive_merge()` to return an `OrderedDict`, so
re-implement `recursive_merge()` 'by hand' without using `mergewith()`.
Should help reduce memory usage and so avoid some errors.
@johnomotani johnomotani merged commit aecba77 into master Nov 13, 2024
21 checks passed
@johnomotani johnomotani deleted the fix-timer-output branch November 13, 2024 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant