Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoints files missing when some ranks not assigned #1215

Open
kpgriesser opened this issue Feb 20, 2025 · 0 comments
Open

Checkpoints files missing when some ranks not assigned #1215

kpgriesser opened this issue Feb 20, 2025 · 0 comments

Comments

@kpgriesser
Copy link

kpgriesser commented Feb 20, 2025

1 - Detailed description of problem or enhancement

While testing combinations of ranks and threads, checkpoints files are not generated in some cases.

2 - Describe how to reproduce the issue

Test case is a 2 component system that simulates for 10 us.
Using --checkpoint-sim-period=1us we see checkpoints generated as expected for these cases:

  • 1 rank, 1 thread
  • 1 rank, 2 threads
  • 1 rank, 3 threads
  • 2 ranks, 1 thread

However, when we use 2 ranks, 2 threads we see no checkpoint files at all along with the warning message 'no components assigned to rank 1.0 and 1.1'.

As a side request, there appears to be 1 simulation checkpoint log message for each thread. For 3 threads, for example: # Simulation Checkpoint: Simulated Time 9 us (Real CPU time since last checkpoint 0.01255 seconds)
# Simulation Checkpoint: Simulated Time 9 us (Real CPU time since last checkpoint 0.01257 seconds)
# Simulation Checkpoint: Simulated Time 9 us (Real CPU time since last checkpoint 0.01260 seconds)
# Simulation Checkpoint: Simulated Time 10 us (Real CPU time since last checkpoint 0.01275 seconds)
# Simulation Checkpoint: Simulated Time 10 us (Real CPU time since last checkpoint 0.01278 seconds)
# Simulation Checkpoint: Simulated Time 10 us (Real CPU time since last checkpoint 0.01288 seconds)

However, there is only 1 log message per rank when using multiple ranks. I would be helpful to reduce the log file size by only producing 1 message per checkpoint regardless of the number of threads and ranks.

3 - What Operating system(s) and versions
All

4 - What versions of external libraries (MPI, etc.)
mpirun (Open MPI) 4.1.2

5 - Provide sha1 of all relevant SST repositories (sst-core, sst-elements, etc)
sst-core

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant