Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step restart may not work with many workers and long run times #418

Open
kustowski1 opened this issue Apr 15, 2023 · 2 comments
Open

Step restart may not work with many workers and long run times #418

kustowski1 opened this issue Apr 15, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@kustowski1
Copy link

kustowski1 commented Apr 15, 2023

With multiple workers and long simulation run times, only half of the workers wake up after the step restart.

I am including the "dummy_simulation.txt" spec and the "workers.txt" batch script. To reproduce the problem on a system with slurm, type:

merlin run dummy_simulation.txt
sbatch workers.txt

and, as soon as the job has started running and the tmp_restart_test_*/dummy*/* subdirectories have been created, type

more tmp_restart_test_*/dummy*/*/*log

and count how many samples reported "Restarting". All 8 samples should restart but only 4 of them do.

Description of the workflow:
-run a "sleep 1" simulation
-restart (after a one-second delay)
-print a line including "Restarting"
-run a "sleep 200" simulation
-terminate: none of the restarted "sleep 200" simulations should finish since the allocation is set to die after 3 minutes.

However, if the "merlin resources" block is removed from the spec, and the test is repeated, all 8 samples report "Restarting", as expected. It may be worth comparing the celery command that is executed in these two scenarios.

dummy_simulation.txt
workers.txt

@kustowski1 kustowski1 added the bug Something isn't working label Apr 15, 2023
@bgunnar5
Copy link
Member

bgunnar5 commented May 8, 2023

I started work on this last week but I think we're going to have to revisit this at a later date (once the new merlin status command is released). Here's what I found so far, though:

  • Problem persists with the latest version of celery (5.2.7)
  • Adding a restart command doesn't fix it
  • Removing the --concurrency 1 option fixes the issue but will this scale? Unsure
    • I believe this is why removing the worker entirely fixed the issue since the default worker uses a higher concurrency
  • Removing the --prefetch-multiplier option (i.e. using the default of 4) seems to make this work a little better but not every sample restarts still
  • Changing the retry delay doesn't resolve this issue
  • Every sample in the log says it's restarting even though roughly half of them do not

Everything here is leading me to believe this is a celery issue but we'll see if the new status command can provide us with more information.

@bgunnar5
Copy link
Member

Here are some links that may be helpful for this issue going forward:

This will require more research but the discussion of these user issues seems to be similar to the problem here. The issue may be due to ETA/countdown with celery tasks:

  1. Tasks are living in the queue with countdown set and a prefetch multiplier of 1
  2. A task is picked up by a celery worker
  3. That task takes a long time to complete so the countdown timer may expire for other tasks in the queue (which can't be completed since the prefetch-multiplier is 1)
  4. The long running task eventually needs to retry but the worker keeps it in it's own memory rather than releasing it back to the queue
  5. Another worker frees up but since the retried task is in the memory of the other worker it can't be fetched by this free worker so it never gets completed

This is my general understanding of the problem so far.

Thank you @lucpeterson for the links.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants