Health check issue with cron scheduler job #741

jorgearimitsu · 2022-10-31T22:20:35Z

We have a cron scheduler job that runs every Monday at 5 am that generates a report and sends it via email.
It's quite an expensive operation that uses all available threads and the CPU usage easily reaches 100%.

Last week, while this job was running, the health check started to fail, killing the good_job instance.
So it got stuck in a loop where a new instance was started, the job ran and the health check failed.

The only error we got was an ActiveRecord::ConnectionTimeoutError. I created a gist with the stack trace.
We couldn't find anything useful in our logs or reproduce this locally. This only happens with this specific job.

The solution was to temporarily disable the health check, let the job run and turn it on again.

The only relevant change we could spot was good_job being bumped from 3.5.0 to 3.5.1 then to 3.6.0.

We're using the external execution mode with a single thread execution pool, GOOD_JOB_MAX_THREADS set to 20.
I calculated the pool size following the docs plus two extra connections: GOOD_JOB_MAX_THREADS + 5
We're using the / health check path.

Any ideas on why this is happening?

The text was updated successfully, but these errors were encountered:

bensheldon · 2022-11-03T14:24:04Z

@jorgearimitsu thank you for opening up the issue. That's not good!

In the stacktrace I see this:

ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.559 seconds); all pooled connections were in use.

I'm imagining that the healthcheck (liveness? which checks the database connection) failed because there were no more database connections, which caused GoodJob to go into a state of "not listening for new jobs":

good_job/lib/good_job/probe_server.rb

Lines 42 to 45 in f342566

    
           when '/status/connected' 
        
             connected = GoodJob::Scheduler.instances.any? && GoodJob::Scheduler.instances.all?(&:running?) && 
        
                         GoodJob::Notifier.instances.any? && GoodJob::Notifier.instances.all?(&:listening?) 
        
             connected ? [200, {}, ["Connected"]] : [503, {}, ["Not connected"]]

Looking at changes in those versions, I don't see anything that would have affected the behavior you're seeing, just a performance improvement to the SQL query. https://github.com/bensheldon/good_job/compare/v3.5.0..v3.6.0

(I think it would be a stretch to imagine that the performance improvement led to greater execution-thread utilization, and thus more concurrent database queries, but that's a possibility).

You could try logging ActiveRecord::Base.connection_pool.stat which would output the current pool status;

{:size=>10, :connections=>3, :busy=>0, :dead=>0, :idle=>3, :waiting=>0, :checkout_timeout=>5.0}

...when you write "It's quite an expensive operation that uses all available threads" it makes me think that we're just not fully accounting for all of the database connections being used within the process. Unless you're already maxing out your entire database's connections, I'd simply double to 40 and see if it keeps crashing. It's not elegant :-)

bensheldon · 2022-11-03T14:35:15Z

btw, if you wanted to continuously monitor connections during your job for debugging, you could do something like this:

def perform
  logger_cancellation = Concurrent::AtomicBoolean.new(false)
  Concurrent::Future.execute(args: [GoodJob::Execution]) do |thr_base|
    while logger_cancellation.false? do
      puts thr_base.connection_pool.stat
      sleep 1
    end
  end

  # <= Do all the work in your job

  logger_cancellation.make_true
end

jorgearimitsu · 2022-11-07T17:29:33Z

Thanks for your help!

I'm imagining that the healthcheck (liveness? which checks the database connection) failed because there were no more database connections, which caused GoodJob to go into a state of "not listening for new jobs

We're using the / healthcheck path. Is this database connection healtcheck related?

...when you write "It's quite an expensive operation that uses all available threads" it makes me think that we're just not fully accounting for all of the database connections being used within the process. Unless you're already maxing out your entire database's connections, I'd simply double to 40 and see if it keeps crashing. It's not elegant :-)

...now that you mention it, we actually decreased the pool size.

We migrated from sidekiq and just inherited the config: pool: ENV["GOOD_JOB_MAX_THREADS"]) * 2 - 40 connections.
Later we decided to follow the docs and decresead it to pool: ENV["GOOD_JOB_MAX_THREADS"]) + 5 - 25 connections.

That's when the issue started...

I followed your suggestion increasing the pool size and it did the trick!

I also logged the current pool status and noticed that we reached 30 connections, 5 more than the previous pool size.

Pool Status: {:size=>60, :connections=>30, :busy=>30, :dead=>0, :idle=>0, :waiting=>0, :checkout_timeout=>5.0}

So you're right in saying: "we're just not fully accounting for all of the database connections being used within the process"

To be more specific about the job:

It's a cron scheduler job that enqueues a reports job for each of our users.
The report job generates the report and enqueue an action mailer job to send it through email.

CronSchedulerJob -> ReportJob(for_each_user) -> ActionMailer::MailerDeliveryJob(for_each_user)

Could this flow require more connections than we're accounting for?

bensheldon · 2022-11-07T18:48:52Z

We're using the / healthcheck path.

Oops, sorry I missed that. That's the simplest health-check which basically checks if Ruby has initialized. I actually wouldn't expect that healthcheck to fail at all unless the Ruby application locked up entirely. That's interesting, but also makes me more unsure of what happened. Putting that aside...

Are you setting a custom GOOD_JOB_QUEUES e.g. --queues=default:10;mailers:10?

jorgearimitsu · 2022-11-22T15:26:34Z

Are you setting a custom GOOD_JOB_QUEUES e.g. --queues=default:10;mailers:10?

Yes, I'm using: GOOD_JOB_QUEUES="reports:10;-reports"
The ReportsJob is executed in the reports queue.

bensheldon · 2022-11-22T15:48:30Z

Nuts, sounds like this is a similar miscalculation to #538 (comment)

reports:10;-reports:[max_threads]

So if max threads is 20, then you're running 30 threads total in the two execution thread pools.

Sorry about the confusion. Another data point indicating the configuration settings "max threads" is a misnomer at this point.

bensheldon moved this to Inbox in GoodJob Backlog v2 Oct 31, 2022

bensheldon added this to GoodJob Backlog v2 Oct 31, 2022

bensheldon closed this as completed Jan 31, 2023

github-project-automation bot moved this from Inbox to Done in GoodJob Backlog v2 Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health check issue with cron scheduler job #741

Health check issue with cron scheduler job #741

jorgearimitsu commented Oct 31, 2022 •

edited

Loading

bensheldon commented Nov 3, 2022

bensheldon commented Nov 3, 2022

jorgearimitsu commented Nov 7, 2022

bensheldon commented Nov 7, 2022

jorgearimitsu commented Nov 22, 2022

bensheldon commented Nov 22, 2022

Health check issue with cron scheduler job #741

Health check issue with cron scheduler job #741

Comments

jorgearimitsu commented Oct 31, 2022 • edited Loading

bensheldon commented Nov 3, 2022

bensheldon commented Nov 3, 2022

jorgearimitsu commented Nov 7, 2022

bensheldon commented Nov 7, 2022

jorgearimitsu commented Nov 22, 2022

bensheldon commented Nov 22, 2022

jorgearimitsu commented Oct 31, 2022 •

edited

Loading