You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With multiple workers and long simulation run times, only half of the workers wake up after the step restart.
I am including the "dummy_simulation.txt" spec and the "workers.txt" batch script. To reproduce the problem on a system with slurm, type:
merlin run dummy_simulation.txt
sbatch workers.txt
and, as soon as the job has started running and the tmp_restart_test_*/dummy*/* subdirectories have been created, type
more tmp_restart_test_*/dummy*/*/*log
and count how many samples reported "Restarting". All 8 samples should restart but only 4 of them do.
Description of the workflow:
-run a "sleep 1" simulation
-restart (after a one-second delay)
-print a line including "Restarting"
-run a "sleep 200" simulation
-terminate: none of the restarted "sleep 200" simulations should finish since the allocation is set to die after 3 minutes.
However, if the "merlin resources" block is removed from the spec, and the test is repeated, all 8 samples report "Restarting", as expected. It may be worth comparing the celery command that is executed in these two scenarios.
I started work on this last week but I think we're going to have to revisit this at a later date (once the new merlin status command is released). Here's what I found so far, though:
Problem persists with the latest version of celery (5.2.7)
Adding a restart command doesn't fix it
Removing the --concurrency 1 option fixes the issue but will this scale? Unsure
I believe this is why removing the worker entirely fixed the issue since the default worker uses a higher concurrency
Removing the --prefetch-multiplier option (i.e. using the default of 4) seems to make this work a little better but not every sample restarts still
Changing the retry delay doesn't resolve this issue
Every sample in the log says it's restarting even though roughly half of them do not
Everything here is leading me to believe this is a celery issue but we'll see if the new status command can provide us with more information.
This will require more research but the discussion of these user issues seems to be similar to the problem here. The issue may be due to ETA/countdown with celery tasks:
Tasks are living in the queue with countdown set and a prefetch multiplier of 1
A task is picked up by a celery worker
That task takes a long time to complete so the countdown timer may expire for other tasks in the queue (which can't be completed since the prefetch-multiplier is 1)
The long running task eventually needs to retry but the worker keeps it in it's own memory rather than releasing it back to the queue
Another worker frees up but since the retried task is in the memory of the other worker it can't be fetched by this free worker so it never gets completed
This is my general understanding of the problem so far.
With multiple workers and long simulation run times, only half of the workers wake up after the step restart.
I am including the "dummy_simulation.txt" spec and the "workers.txt" batch script. To reproduce the problem on a system with slurm, type:
merlin run dummy_simulation.txt
sbatch workers.txt
and, as soon as the job has started running and the tmp_restart_test_*/dummy*/* subdirectories have been created, type
more tmp_restart_test_*/dummy*/*/*log
and count how many samples reported "Restarting". All 8 samples should restart but only 4 of them do.
Description of the workflow:
-run a "sleep 1" simulation
-restart (after a one-second delay)
-print a line including "Restarting"
-run a "sleep 200" simulation
-terminate: none of the restarted "sleep 200" simulations should finish since the allocation is set to die after 3 minutes.
However, if the "merlin resources" block is removed from the spec, and the test is repeated, all 8 samples report "Restarting", as expected. It may be worth comparing the celery command that is executed in these two scenarios.
dummy_simulation.txt
workers.txt
The text was updated successfully, but these errors were encountered: