How to gracefully shutdown external worker processes #509

til · 2022-02-01T12:35:04Z

til
Feb 1, 2022

Hi there,

When running good_job with execution_mode = :external with bundle exec good_job start commands in docker containers, what would be a recommended approach to gracefully shutdown the worker process – so that its threads that are currently processing jobs finish those jobs, also if those jobs potentially still take a long time to finish, but don't start processing any new jobs, and once all threads are finished, the process terminates automatically?

It looks to me like sending an INT or TERM signal to the process would achieve this. There is a trap call (here: https://github.com/bensheldon/good_job/blob/main/lib/good_job/cli.rb#L106-L108) that sets @stop_good_job_executable. I haven't found any explicit mention in the documentation though, so I'm wondering if this is the intended way.

If so, I'd still need to find a way to do this in our particular environment, AWS Fargate, which as far as I see doesn't allow to send signals to running tasks. One can only stop running tasks, which will send a TERM signal, wait for a maximum of 2 minutes, and then forcibly shut them down, which is way too short unfortunately. I think sending a signal to its own process from an ActiveJob callback that checks some custom condition might work.

Are there any other approaches? E.g. would it theoretically be possible to publish such a shutdown request with NOTIFY? There is some code mentioning shutdown in https://github.com/bensheldon/good_job/blob/main/lib/good_job/notifier.rb, but I don't understand yet what it does - is it maybe only relevant for execution_mode = :async?

Thanks in advance

bensheldon · 2022-02-01T18:14:39Z

bensheldon
Feb 1, 2022
Maintainer

@til Thanks for asking and also for digging into the code. My assumption with GoodJob is that it is run in a containerized environment that orchestrates similar to what you laid out with Fargate:

Container supervisor sends a TERM to the container's process (e.g. good_job cli), which begins a graceful shutdown. The process either exits, or...
After a timeout period (I use Heroku primarily, which has a 30-second timeout), the supervisor KILLs the process

It's also possible for GoodJob to abort active threads after receiving a TERM, which can be configured with the shutdown timeout (the shutdown timeout is disabled by default). For example, in my Heroku example, I might set that timeout to 25 seconds so that GoodJob abort threads and exits before being killed by the supervisor.

Can you say more about the kinds of jobs you're running that 2 minutes wouldn't be sufficient? I'm not aware of other job systems that have some other shutdown mechanism than signals; please share if you're inspired by another example.

I'm wondering if exposing a cancellation would be sufficient for you to exit long-running jobs. e.g.

class ReallyBigJob < ApplicationJob
  def perform
    lots_of_items.each do |item|
      break unless GoodJob.running? # or `if GoodJob.shutting_down? || GoodJob.shutdown?`
      # do some processing
    end
  end
end

4 replies

immerrr Feb 12, 2024

@bensheldon great that I have stumbled upon this response 🙌

Have you been able to get any progress on the GoodJob.shutting_down?-like API? I'm looking for a way to do smth like

if GoodJob.shutting_down?
  raise GoodJob::ShutDownAndReschedule
end

I'm currently migrating off a free-sidekiq-based job queue, so having jobs restart automatically after their worker has died is already a great bonus, but having the worker die at a predictable point of a long-running job (within the allotted shutdown timeout) would be even better. Ideal outcome in my use case would be that:

there is some visibility about the job being interrupted, like a log message, instead of a complete silence after the thread is terminated
the retry doesn't eat into a potentially limited max number of attempts
I can prevent the exception report from being sent to dev-oriented alerting system, e.g. Sentry, because devs don't have anything to do with, say, Kubernetes node restart
minimum number of steps required to

Happy to help get something like this into mainline, or would also appreciate any hints on how to "hack" it on top of the latest-ish release.

Thanks!

bensheldon Feb 17, 2024
Maintainer

Yep! I just pushed up the simplest spike I could think of with some questions for feedback: #1253

bensheldon Feb 17, 2024
Maintainer

The simplest usage of this would be like:

def perform
  loop do
    if GoodJob.thread_shutting_down?
      retry_job
      break
    end
    do_stuff
  end
end

immerrr Feb 26, 2024

@bensheldon thank you so much! Will try this slightly later and let you know how it went.

til · 2022-02-01T23:26:45Z

til
Feb 1, 2022
Author

Thanks for your feedback (and for good_job of course)!

The particular use case I need to support is a system where users can request the creation of data exports for a date range. The exports can become quite large, up to a few hundred megabytes, and involve long running database queries, scanning and aggregating data from AWS S3, and requests to external APIs. The end result is a large single file on S3 that the user can then download, e.g. a gzip compressed CSV. The processing itself is implemented in a streaming fashion, so it doesn't exhaust memory – the runtime simply scales linearly with the size. The users are OK with a certain wait time depending on the expected size. There is feedback while the job runs to inform users about the progress.

Changing this particular feature to be able to interrupted and resume would make it significantly more complicated than it is now I think. Also because compatibility of deployments would be an issue: it's desired that also a larger running job is processed with only the deployed version of the code it started with, instead of potentially changing its behavior mid-way.

It is certainly a rare use case, I agree that in most cases having many smaller jobs is the better approach. However I could imagine there are a few other similar use cases when processing large files are involved, e.g. video encoding.

Sidekiq Pro has a feature called quiet: https://rewind.com/blog/controlling-sidekiq-workers-on-aws-with-fargate-and-ssm-commands/ which I found searching for how to send signals on Fargate. GoodJob's behavior sounds even more useful to me. If I understand correctly,

Container supervisor sends a TERM to the container's process (e.g. good_job cli), which begins a graceful shutdown. The process either exits, or...

when the timeout is disabled or very long, that means sending the good_job cli process a TERM signal is already doing what I want: it stops all its threads accepting new jobs, and will terminate the whole process once all its threads are done.

Then the only problem remains, how to get the signal to the process in an environment that doesn't support sending signals directly.

I therefore don't think, exposing a cancellation would help, or maybe I didn't understand you correctly.

2 replies

bensheldon Feb 2, 2022
Maintainer

Thanks for finding that example of Sidekiq's "quiet" feature. That's really interesting!

I see two things here to tease apart:

Uncontrolled shutdown: Your jobs will be TERM/KILLed (by the container supervisor, by the host system, by the power company, it's out of your control). I just want to flag that and move it out of the realm of "exceptional" and into "expected but unpredictable". It's table-stakes for your jobs to be tolerant of interruptions and ideally idempotent.
Controlled shutdown. I'm sympathetic that deliberate deploys/scaling/restarts should be minimally disruptive. I think the general challenge here is to trigger a process of: starting a shutdown -> waiting till jobs finish -> finalizing the shutdown -> exit. The challenge here is that the "finalizing/exiting" during deploy is managed by the container supervisor. e.g. if the process exits without a signal from the supervior, it's likely the container supervisor will restart the existing/old process, which is not what is desired.

On GoodJob's side, I can see a feature proposal to respond to a similar TSTP signal (or globally through the Dashboard/database) that would cause good_job executors to finish jobs (maybe "drain!"/"draining"/"drained" to avoid the verb/adjective confusion of "quiet") and then wait. The processes would also need to be queryable so that the deployment process could check on whether they were fully drained of jobs. So you would have a deploy script that looks something like this:

#!/usr/bin/env ruby

Process.all.each(&:drain!)
until Process.all?(&:drained?)
  sleep 10
end

# ... trigger the container supervisor's deploy/shutdown/restart

Aside: This leads to many subsequent observations, technical ones like where this script runs and how it's orchestrated with the rest of the deployment; and business ones like how it would limit your overall ability to deploy and whether the latency introduced by waiting to execute jobs is worse than interrupting them (e.g. if a draining deploy takes 1 hour, that means that your users enqueuing new jobs would be blocked for 1 hour). And given my belief that you have to plan for uncontrolled shutdown, it seems like double work to build out controlled shutdowns too. And you know your business better than me, which is why these thoughts are simply an aside 😅

From my perspective as a GoodJob maintainer, I think having "job drain" feature is doable, but I want to be realistic that it's likely many months out. Involving:

Behavior for GoodJob executors to be drained and queryable for status
Triggering of that drain, from a signal on the local machine, and via the database (LISTEN/NOTIFY, or by polling their process record)
Exposing functionality through the web dashboard too

til Feb 3, 2022
Author

Uncontrolled shutdown

Thanks for pointing this out, uncontrolled shutdowns are of course
unavoidable, and jobs should be tolerant to them.

In my particular use case, those were fortunately never a big concern.
In the rare case of an unexpected shutdown, there would have been an
extra wait time - annoying but tolerable. Why I find a controlled
shutdown mechanism in the presence of potentially long running jobs
nevertheless important, is as you said that deploys should be minimally
disruptive – there should be no risk of a developer ever forming a doubt
like "better not merge into main now, a deploy might be noticed by a
customer".

The challenge here is that the "finalizing/exiting" during deploy is
managed by the container supervisor. e.g. if the process exits without
a signal from the supervior, it's likely the container supervisor will
restart the existing/old process, which is not what is desired.

This behavior is probably common in most setups. It didn't occur to me
at first that a supervisor would restart a process which exits on its
own, because in our current setup of DelayedJob there is a custom
component which starts new processes on demand, and this is also
triggered on deployment. So it is possible for running processes to exit
deliberately without causing an unwanted restart. Such a custom
component comes with drawbacks of course, and I would prefer to be able
to get rid of it.

Happy to hear you see job drain state as a possible feature for GoodJob.

if a draining deploy takes 1 hour, that means that your users
enqueuing new jobs would be blocked for 1 hour

I don't see why new jobs would have to be blocked? A deploy could, once
the new code is available e.g. as a docker image, start a new worker
process, and when the new process is up and accepting jobs, send
drain! to the old process. Then the old process could continue to run
as long as needed with the remaining long running jobs, while newly
enqueued jobs are all handled by the new process – no blocking involved.
But it would require some custom orchestration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to gracefully shutdown external worker processes #509

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to gracefully shutdown external worker processes #509

til Feb 1, 2022

Replies: 2 comments · 6 replies

bensheldon Feb 1, 2022 Maintainer

immerrr Feb 12, 2024

bensheldon Feb 17, 2024 Maintainer

bensheldon Feb 17, 2024 Maintainer

immerrr Feb 26, 2024

til Feb 1, 2022 Author

bensheldon Feb 2, 2022 Maintainer

til Feb 3, 2022 Author

til
Feb 1, 2022

Replies: 2 comments 6 replies

bensheldon
Feb 1, 2022
Maintainer

bensheldon Feb 17, 2024
Maintainer

bensheldon Feb 17, 2024
Maintainer

til
Feb 1, 2022
Author

bensheldon Feb 2, 2022
Maintainer

til Feb 3, 2022
Author