Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health check issue with cron scheduler job #741

Closed
jorgearimitsu opened this issue Oct 31, 2022 · 6 comments
Closed

Health check issue with cron scheduler job #741

jorgearimitsu opened this issue Oct 31, 2022 · 6 comments

Comments

@jorgearimitsu
Copy link

jorgearimitsu commented Oct 31, 2022

We have a cron scheduler job that runs every Monday at 5 am that generates a report and sends it via email.
It's quite an expensive operation that uses all available threads and the CPU usage easily reaches 100%.

Last week, while this job was running, the health check started to fail, killing the good_job instance.
So it got stuck in a loop where a new instance was started, the job ran and the health check failed.

The only error we got was an ActiveRecord::ConnectionTimeoutError. I created a gist with the stack trace.
We couldn't find anything useful in our logs or reproduce this locally. This only happens with this specific job.

The solution was to temporarily disable the health check, let the job run and turn it on again.

The only relevant change we could spot was good_job being bumped from 3.5.0 to 3.5.1 then to 3.6.0.

We're using the external execution mode with a single thread execution pool, GOOD_JOB_MAX_THREADS set to 20.
I calculated the pool size following the docs plus two extra connections: GOOD_JOB_MAX_THREADS + 5
We're using the / health check path.

Any ideas on why this is happening?

@bensheldon
Copy link
Owner

@jorgearimitsu thank you for opening up the issue. That's not good!

In the stacktrace I see this:

ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.559 seconds); all pooled connections were in use.

I'm imagining that the healthcheck (liveness? which checks the database connection) failed because there were no more database connections, which caused GoodJob to go into a state of "not listening for new jobs":

when '/status/connected'
connected = GoodJob::Scheduler.instances.any? && GoodJob::Scheduler.instances.all?(&:running?) &&
GoodJob::Notifier.instances.any? && GoodJob::Notifier.instances.all?(&:listening?)
connected ? [200, {}, ["Connected"]] : [503, {}, ["Not connected"]]

Looking at changes in those versions, I don't see anything that would have affected the behavior you're seeing, just a performance improvement to the SQL query. https://github.com/bensheldon/good_job/compare/v3.5.0..v3.6.0

(I think it would be a stretch to imagine that the performance improvement led to greater execution-thread utilization, and thus more concurrent database queries, but that's a possibility).

You could try logging ActiveRecord::Base.connection_pool.stat which would output the current pool status;

{:size=>10, :connections=>3, :busy=>0, :dead=>0, :idle=>3, :waiting=>0, :checkout_timeout=>5.0}

...when you write "It's quite an expensive operation that uses all available threads" it makes me think that we're just not fully accounting for all of the database connections being used within the process. Unless you're already maxing out your entire database's connections, I'd simply double to 40 and see if it keeps crashing. It's not elegant :-)

@bensheldon
Copy link
Owner

btw, if you wanted to continuously monitor connections during your job for debugging, you could do something like this:

def perform
  logger_cancellation = Concurrent::AtomicBoolean.new(false)
  Concurrent::Future.execute(args: [GoodJob::Execution]) do |thr_base|
    while logger_cancellation.false? do
      puts thr_base.connection_pool.stat
      sleep 1
    end
  end

  # <= Do all the work in your job

  logger_cancellation.make_true
end

@jorgearimitsu
Copy link
Author

Thanks for your help!

I'm imagining that the healthcheck (liveness? which checks the database connection) failed because there were no more database connections, which caused GoodJob to go into a state of "not listening for new jobs

We're using the / healthcheck path. Is this database connection healtcheck related?

...when you write "It's quite an expensive operation that uses all available threads" it makes me think that we're just not fully accounting for all of the database connections being used within the process. Unless you're already maxing out your entire database's connections, I'd simply double to 40 and see if it keeps crashing. It's not elegant :-)

...now that you mention it, we actually decreased the pool size.

We migrated from sidekiq and just inherited the config: pool: ENV["GOOD_JOB_MAX_THREADS"]) * 2 - 40 connections.
Later we decided to follow the docs and decresead it to pool: ENV["GOOD_JOB_MAX_THREADS"]) + 5 - 25 connections.

That's when the issue started...

I followed your suggestion increasing the pool size and it did the trick!

I also logged the current pool status and noticed that we reached 30 connections, 5 more than the previous pool size.

Pool Status: {:size=>60, :connections=>30, :busy=>30, :dead=>0, :idle=>0, :waiting=>0, :checkout_timeout=>5.0}

So you're right in saying: "we're just not fully accounting for all of the database connections being used within the process"

To be more specific about the job:

It's a cron scheduler job that enqueues a reports job for each of our users.
The report job generates the report and enqueue an action mailer job to send it through email.

CronSchedulerJob -> ReportJob(for_each_user) -> ActionMailer::MailerDeliveryJob(for_each_user)

Could this flow require more connections than we're accounting for?

@bensheldon
Copy link
Owner

We're using the / healthcheck path.

Oops, sorry I missed that. That's the simplest health-check which basically checks if Ruby has initialized. I actually wouldn't expect that healthcheck to fail at all unless the Ruby application locked up entirely. That's interesting, but also makes me more unsure of what happened. Putting that aside...

Are you setting a custom GOOD_JOB_QUEUES e.g. --queues=default:10;mailers:10?

@jorgearimitsu
Copy link
Author

Are you setting a custom GOOD_JOB_QUEUES e.g. --queues=default:10;mailers:10?

Yes, I'm using: GOOD_JOB_QUEUES="reports:10;-reports"
The ReportsJob is executed in the reports queue.

@bensheldon
Copy link
Owner

Nuts, sounds like this is a similar miscalculation to #538 (comment)

reports:10;-reports:[max_threads]

So if max threads is 20, then you're running 30 threads total in the two execution thread pools.

Sorry about the confusion. Another data point indicating the configuration settings "max threads" is a misnomer at this point.

@github-project-automation github-project-automation bot moved this from Inbox to Done in GoodJob Backlog v2 Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants