-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High database load from job locking query #1549
Comments
We upgraded to Postgres 16, and these queries are now running a bit faster (more like ~35ms avg). However, this may also have been due to us reducing job retention to 1 day. |
This is unfortunately known. The GoodJob locking query has to materialize the inner CTE query, which can be expensive. You could try reducing the value of Another maintenance thing you might try is running Also, I'm looking at the indexes and I'm wondering if there could be a more efficient index (the query has changed a little bit in GoodJob v4): Lines 100 to 103 in 81d824a
Cause I'm not seeing a good index specifically for the queue_name-limited situation. (also, it's unfortunate that Covering Indexes only are supported in Rails 7.2+) |
fyi about:
The query will atomically fetch-and-lock the next runnable job that exists in the database.
So while it does do what you described, it does that entirely within a single database query using a materialized CTE. |
That only affects the DSL to create them in migrations though. Since GoodJob is Postgres specific, there is no reason it couldn't create those indices directly, right? Some care may need to be taken with-regard-to how |
Not a true blocker. Though one of my constraints is not forcing |
Ah that's fair enough. I personally never use It may still warrant testing if it improves query performance and making it a suggestion for people to add themselves. |
Thanks for the replies @bensheldon.
I tried to investigate this a bit. I generated a query without so many variables like so: GoodJob::Job.queue_string("latency_10s").unfinished.dequeueing_ordered(GoodJob::Job.queue_parser("latency_10s")).only_scheduled.limit(1).advisory_lock This was based on reading how the code generates the query. It actually runs a query rather than printing it out, but I was able to use the logging output to grab it. I then formatted it and replaced a few variables to get thisSELECT
"good_jobs"."id",
"good_jobs"."queue_name",
"good_jobs"."priority",
"good_jobs"."serialized_params",
"good_jobs"."scheduled_at",
"good_jobs"."performed_at",
"good_jobs"."finished_at",
"good_jobs"."error",
"good_jobs"."created_at",
"good_jobs"."updated_at",
"good_jobs"."active_job_id",
"good_jobs"."concurrency_key",
"good_jobs"."cron_key",
"good_jobs"."cron_at",
"good_jobs"."batch_id",
"good_jobs"."batch_callback_id",
"good_jobs"."executions_count",
"good_jobs"."job_class",
"good_jobs"."error_event",
"good_jobs"."labels",
"good_jobs"."locked_by_id",
"good_jobs"."locked_at"
FROM
"good_jobs"
WHERE
"good_jobs"."id" IN (
WITH "rows" AS MATERIALIZED (
SELECT
"good_jobs"."id"
FROM
"good_jobs"
WHERE
"good_jobs"."queue_name" = 'latency_10s'
AND "good_jobs"."finished_at" IS NULL
AND (
"good_jobs"."scheduled_at" <= now()
OR "good_jobs"."scheduled_at" IS NULL
)
ORDER BY
priority ASC NULLS LAST,
"good_jobs"."created_at" ASC
)
SELECT
"rows"."id"
FROM
"rows"
WHERE
pg_try_advisory_lock(
(
'x' || substr(
md5(
'good_jobs' || '-' || "rows"."id" :: text
),
1,
16
)
):: bit(64):: bigint
)
LIMIT
1
)
ORDER BY
priority ASC NULLS LAST,
"good_jobs"."created_at" ASC I then did `EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)`
Note that this is run against a replica of our production database, so it might not reflect the actual numbers in production (but the data is the same). The major time spent is in the index scan of
I don't think adding another index would help, as
We could try this, but we don't generally have many jobs sitting in the queue, so this query will already not return many rows. This seems like more of a worst-case optimisation for when the queues get backed up?
Happy to try it. The other thing we do plan to do is change our job retention to be very low, now that #1550 is merged - we'll retain discarded jobs a bit longer but succeeded jobs can be pretty ephemeral. This should make the overall size of our |
Thanks for this explanation. So it is essentially looping over all runnable jobs to find the first one that can be grabbed. What happens if it doesn't get any jobs? (I.e. the queue is empty, all jobs are locked.) Does the work immediately try to re-run the query? Should it wait a bit or something? I wonder if this explains why the query gets run so many times? |
I ran it against primary, because IME the replica's do have very different statistics:
Here is a visualisation based on a JSON version of the above: https://explain.dalibo.com/plan/12b14990491eb039 |
We tried this and pghero is now showing average query time around the ~30ms mark. Which I'm slightly confused about because both those query plans show ~10ms. (Note also that the second plan was done after the reindex.) |
Just to report back about this, we have turned off GJ's job cleanup stuff, and implemented our own maintenance job that basically does: GoodJob.cleanup_preserved_jobs(older_than: 1.minute, include_discarded: false)
GoodJob.cleanup_preserved_jobs(older_than: 5.days, include_discarded: true) Ahead of the deploy, we manually deleted job records older than 1 minute. We have now seen that query drop to around 2-3ms average runtime, which is a significant improvement. Unfortunately the job dashboard is still pretty slow to load, even though we now have only ~17k job records. (Most of them are future-scheduled jobs - we could probably implement a better design for the functionality that needs this in our app, but at the moment we are reliant on it and that's what keeps the table a bit bigger.) I think it would be worth considering making this sort of thing the default behaviour for retention in Good Job. Keeping fewer succeeded jobs around makes many problems go away. (Although I guess functionality like the 'Performance' tab in the dashboard would need a different way of storing the data - but that could be rolled up metrics in a separate table.) |
That's strange that the query is so slow. This is unfortunately one of those things about relational databases: there is a lot more knobs to twist for tuning. I looked at the performance of this query on the GoodJob Demo, which usually has ~100k job rows and does about 30k of these queries an hour... and is on a fairly underpowered Heroku shared database (essential-0). It's averaging ~1ms on this query: https://goodjob-demo.herokuapp.com/pghero/queries/6923404554479764969?user=udne4cvup0tecn ...though that query is not using named queues. Your sharing the EXPLAIN ANALYZE is really helpful. I think the issue here is more that we don't have the right index when queue_name is used in the query. The Bitmap Index Scan / BitmapOr it's doing doesn't look very efficient (and a heap scan!) Would you be willing to try adding a new index? add_index :good_jobs, ["queue_name", "priority", "created_at"], name: "index_good_job_jobs_for_candidate_lookup_by_queue", where: "(finished_at IS NULL)" |
Hey @bensheldon, we'll add that index and report back. Since we lowered our retention the numbers already look much better, but we can at least get a before/after query plan and I'd be open to re-raising our retention temporarily to see how the index fairs. |
We added the index you suggested, here are before/after plans: Before: After (we also did It doesn't look like the new index is used in the plan, but bear in mind that our jobs table currently has only ~18k rows. If you think it would be useful, we could try increasing our job retention again so that the table gets bigger, and then see if the index is used. |
We are seeing high database load from Good Job's locking query.
Here are the four queries, as reported by pghero, with calls / avg runtime / percentage runtime, over a 5 minute period:
Query one (1,881 calls, 45 ms avg, 24%)
Query two (1,424 calls, 46 ms avg, 18%)
Query three (1,302 calls, 42 ms avg, 15%)
Query four (1,050 calls, 42 ms avg, 12%)
The only difference between these queries is the
"good_jobs"."queue_name"
constraints.Our queues are configured like so:
The four queries clearly correspond to each of those worker configurations.
We currently run four worker processes (so 4*5 = 20 total threads).
During a 5 minute period, we processed approximately 1,000 jobs. The sum of the query calls is 5,657.
Without being super familiar with Good Job's implementation details, my understanding is that this query obtains a lock for the job it is about to perform. Obtaining the lock may fail, in which case a worker will need to try again.
Perhaps each worker is racing to obtain a lock for a new job as it gets pushed onto the queue? But only one worker can actually obtain that lock.
Is this expected? I wonder if there is a way to structure this so that it doesn't require quite so many queries?
In the mean time, we are going to try reducing our job retention time. This presumably won't reduce the number of queries, but perhaps it will make them run faster and take up less cumulative time. We currently retain jobs for 5 days, and we have ~1.7m jobs preserved over that period.
Note that we already have
GOOD_JOB_QUEUE_SELECT_LIMIT=1000
set.We are using Postgres 14.
The text was updated successfully, but these errors were encountered: