-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Health check issue with cron scheduler job #741
Comments
@jorgearimitsu thank you for opening up the issue. That's not good! In the stacktrace I see this:
I'm imagining that the healthcheck ( good_job/lib/good_job/probe_server.rb Lines 42 to 45 in f342566
Looking at changes in those versions, I don't see anything that would have affected the behavior you're seeing, just a performance improvement to the SQL query. https://github.com/bensheldon/good_job/compare/v3.5.0..v3.6.0 (I think it would be a stretch to imagine that the performance improvement led to greater execution-thread utilization, and thus more concurrent database queries, but that's a possibility). You could try logging {:size=>10, :connections=>3, :busy=>0, :dead=>0, :idle=>3, :waiting=>0, :checkout_timeout=>5.0} ...when you write "It's quite an expensive operation that uses all available threads" it makes me think that we're just not fully accounting for all of the database connections being used within the process. Unless you're already maxing out your entire database's connections, I'd simply double to |
btw, if you wanted to continuously monitor connections during your job for debugging, you could do something like this: def perform
logger_cancellation = Concurrent::AtomicBoolean.new(false)
Concurrent::Future.execute(args: [GoodJob::Execution]) do |thr_base|
while logger_cancellation.false? do
puts thr_base.connection_pool.stat
sleep 1
end
end
# <= Do all the work in your job
logger_cancellation.make_true
end |
Thanks for your help!
We're using the
...now that you mention it, we actually decreased the pool size. We migrated from sidekiq and just inherited the config: That's when the issue started... I followed your suggestion increasing the pool size and it did the trick! I also logged the current pool status and noticed that we reached 30 connections, 5 more than the previous pool size.
So you're right in saying: "we're just not fully accounting for all of the database connections being used within the process" To be more specific about the job: It's a cron scheduler job that enqueues a reports job for each of our users.
Could this flow require more connections than we're accounting for? |
Oops, sorry I missed that. That's the simplest health-check which basically checks if Ruby has initialized. I actually wouldn't expect that healthcheck to fail at all unless the Ruby application locked up entirely. That's interesting, but also makes me more unsure of what happened. Putting that aside... Are you setting a custom |
Yes, I'm using: |
Nuts, sounds like this is a similar miscalculation to #538 (comment)
So if max threads is 20, then you're running 30 threads total in the two execution thread pools. Sorry about the confusion. Another data point indicating the configuration settings "max threads" is a misnomer at this point. |
We have a cron scheduler job that runs every Monday at 5 am that generates a report and sends it via email.
It's quite an expensive operation that uses all available threads and the CPU usage easily reaches 100%.
Last week, while this job was running, the health check started to fail, killing the good_job instance.
So it got stuck in a loop where a new instance was started, the job ran and the health check failed.
The only error we got was an
ActiveRecord::ConnectionTimeoutError
. I created a gist with the stack trace.We couldn't find anything useful in our logs or reproduce this locally. This only happens with this specific job.
The solution was to temporarily disable the health check, let the job run and turn it on again.
The only relevant change we could spot was
good_job
being bumped from 3.5.0 to 3.5.1 then to 3.6.0.We're using the
external
execution mode with a single thread execution pool,GOOD_JOB_MAX_THREADS
set to 20.I calculated the pool size following the docs plus two extra connections:
GOOD_JOB_MAX_THREADS + 5
We're using the
/
health check path.Any ideas on why this is happening?
The text was updated successfully, but these errors were encountered: