From ecbf31efbfd6a2dbc68b1f698dc63a2cc3ff014a Mon Sep 17 00:00:00 2001 From: "Eric T. Johnson" Date: Tue, 15 Oct 2024 18:03:54 -0400 Subject: [PATCH] Add note about auto-checkpointing timing out --- sphinx_docs/source/nersc-workflow.rst | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/sphinx_docs/source/nersc-workflow.rst b/sphinx_docs/source/nersc-workflow.rst index 6f77015..febe568 100644 --- a/sphinx_docs/source/nersc-workflow.rst +++ b/sphinx_docs/source/nersc-workflow.rst @@ -29,6 +29,13 @@ includes the restart logic to allow for job chaining. ``amrex.the_arena_init_size=0`` after ``${restartString}`` in the srun call so AMReX doesn't reserve 3/4 of the GPU memory for the device arena. +.. note:: + + If the job times out before writing out a checkpoint (leaving a + ``dump_and_stop`` file behind), you can give it more time between the + warning signal and the end of the allocation by adjusting the + ``#SBATCH --signal=B:URG@`` line at the top of the script. + Below is an example that runs on CPU-only nodes. Here ``ntasks-per-node`` refers to number of MPI processes (used for distributed parallelism) per node, and ``cpus-per-task`` refers to number of hyper threads used per task