Proposal: convert from session-based Advisory Locks to locked_at/locked_by columns #831
Replies: 5 comments 6 replies
-
Having followed along some of your recent blog/reddit posts, I am happy to see this discussion! Not because I need these changes personally right now, but because scalability/performance is obviously at top-of-mind for a lot of folks as they choose a background job processor, and that's where the discussion inevitably seems to go in any forum discussing postgres/mysql queues. 2 non-technical opinions:
My opinion is no. I see this more as a natural evolution of GoodJob as it and its community grows, akin to your decision to move beyond just supporting core ActiveJob capabilities early on (which has been great for GoodJob).
Even though I am not in need of these changes and would love to see some of those mentioned new features, I tend to think that supporting very large deployments would be extremely beneficial long term for GoodJob: social proof of larger/name-recognizable users will help make it easier to win over prospective users and grow the community, introducing GoodJob to increasingly sophisticated deployments will help flush out obscure/outlier bugs, etc. I don't know if any of this is helpful, but hopefully it can help spark further discussion! |
Beta Was this translation helpful? Give feedback.
-
I appreciate your thoughtfulness as you're exploring this. I don't know that I have much of an opinion either direction. I'm content with the session-based locks at the moment, but that likely is an artifact of not running high enough job volumes for it to be a problem. Nevertheless, a couple of quick thoughts:
Not sure if this field is intended to be informational or as part of tracking locked jobs, but Some docker/containerized deployments number processes starting from 1 in each container, so PIDs can easily repeat themselves across containers (usually as PID 1). I believe most containers do set hostname according to the container name. Of course, PIDs can potentially repeat across hosts too. If the intent is just informational, hostname would be a nice to have. If tracking/locking/retrying based on the PID value (and not just presence), then hostname might be a must.
You may have this documented elsewhere and I just haven't run across it, but what's the value of storing Executions at all? It would seem that DB rows could be reduced quite a bit (especially when retrying, whether due to concurrency, errors, or intentional retries) by storing just the Job and updating the existing record when retrying, as opposed to adding what feel on the surface like duplicate rows. Would storing only a single Job record help scaling? |
Beta Was this translation helpful? Give feedback.
-
I would love this, I've been thinking about it ever since I first discovered good job -- does it REALLY need advisory locks? To me it is a higher priority than most, possibly all, of the other things you list. it is the main thing I am aware of keeping me from adopting GoodJob in my production projects. I agree that the main challenge is the added complexity of process heartbeat. (Is process heartbeat the design sidekiq uses? I think so?) Right now, I think advisory locks give you better behavior than most other ActiveJob queues with regard to killed/failed jobs, like they'll be noticed very very quickly? While we don't want killed jobs to be lost, we want some way for them to be noticed and retried, it doesn't need to be extremely short latency or better than alternatives, it just needs to... do a good job! I also wonder if removing AdvisoryLocks could help it arrive at something that didn't actually require postgres, and could work with MySQL too (or conceivably any AR-supported rdbms), which would definitely increase the audience yet further. Are there other pg-specific features being used still? If so, no big deal, eliminating postgres dependency is really a separate question, and I think it's fine to require pg if there's reason to... but pretty sweet not to if it can swing.
As I think you discovered, building a reliable polished production-ready job back-end is a lot harder than it seems. There consequently very few real peers in the field, those that do exist have not been started recently (and have varying levels of maintenance; resque is kind of on life support). As someone who would love something that is open source, uses an rdbms back-end, is polished and production ready, and does not require session-based features... I suspect GoodJob is more likely than any other route to get us there, I think it unlikely any other from-scratch project is going to appear (let alone get to production-readiness) anytime soon. :) |
Beta Was this translation helpful? Give feedback.
-
UpdateI've made a lot of progress down this path. I wanted to summarize where things are. Some of this is redundant of what I previously wrote, but figured it was easier to read (and easier to write) to say it again: In regards to versions:
Current locking strategyBriefly, the current (v3) GoodJob takes an advisory lock on individual rows of the In addition to ensuring jobs are appropriately locked (and without duplicate performs), advisory locks are nice because they provide ideal minimal latency in the event of process SIGKILL because the job is implicitly unlocked when the database connection is severed. I think that the "what happens if the process unexpectedly exits?" is the interesting technical challenge here, with my intention to avoid introducing operational constraints such as job maximum-execution timeouts. Future locking strategyThe future locking strategy is a combination of two pieces:
Also, there are a number of additional places that use Advisory Locks (e.g. calculations for concurrency controls, or calculations around Batch callbacks). I do not expect that one could get rid of all advisory locks, but expect that it will not be necessary to use Connection-based advisory locks; Transaction-based ones will be used instead (currently I think Connection-based advisory locks are used, but that's hopefully unintentional and I'm not missing anything 😓 ) So in summary:
The goalsI see two reasons for making this change:
The values
|
Beta Was this translation helpful? Give feedback.
-
I may have found a Postges DBA consultant. Sharing more details for their review: The current queryThis is the current locking query, which uses a connection-level advisory lock to dequeue a job: SELECT
"good_jobs".*
FROM
"good_jobs"
WHERE
"good_jobs"."id" IN (
WITH "rows" AS MATERIALIZED (
SELECT
"good_jobs"."id",
"good_jobs"."active_job_id"
FROM
"good_jobs"
WHERE
"good_jobs"."finished_at" IS NULL
AND "good_jobs"."scheduled_at" <= '2023-02-15 20:55:33.668333'
--- and other conditions, like queue_name
ORDER BY
priority DESC NULLS LAST,
created_at ASC
)
SELECT
"rows"."id"
FROM
"rows"
WHERE
pg_try_advisory_lock(('x' || substr(md5('good_jobs' || '-' || "rows"."active_job_id"::text), 1, 16))::bit(64)::bigint)
LIMIT $1
)
ORDER BY
priority DESC NULLS LAST,
created_at ASC Just looking at that query, I realize that the I have observed that every ~10,000 times, under heavy concurrency, a result will be returned that is not actually advisory locked. To work around that, there is a second query to check that it is advisory locked, that has been reported to be slow. Postgres doesn't have a function to check an advisory lock, so this is somewhat of a workaround for that by querying the SELECT
1 AS one
FROM "good_jobs"
LEFT JOIN
pg_locks ON
pg_locks.locktype = 'advisory'
AND pg_locks.objsubid = 1
AND pg_locks.classid = ('x' || substr(md5('good_jobs' || '-' || "good_jobs"."active_job_id"::text), 1, 16))::bit(32)::int
AND pg_locks.objid = (('x' || substr(md5('good_jobs' || '-' || "good_jobs"."active_job_id"::text), 1, 16))::bit(64) << 32)::bit(32)::int
WHERE
"good_jobs"."finished_at" IS NULL
AND ("pg_locks"."pid" = pg_backend_pid())
AND "good_jobs"."id" = $1
LIMIT 1 This is the index being used: t.index ["priority", "created_at"],
name: "index_good_jobs_jobs_on_priority_created_at_when_unfinished",
order: { priority: "DESC NULLS LAST" },
where: "(finished_at IS NULL)" The future queryThe goal is to replace the advisory lock query with a more typical row-level dequeue query: WITH "rows" AS MATERIALIZED (
SELECT
*
FROM "good_jobs"
WHERE
"good_jobs".locked_by_id IS NULL
AND "good_jobs"."finished_at" IS NULL
AND "good_jobs"."scheduled_at" <= '2023-02-15 20:55:33.668333'
--- and other conditions, like queue_name
ORDER BY
priority ASC,
created_at ASC
LIMIT 1
FOR NO KEY UPDATE SKIP LOCKED
)
UPDATE "good_jobs"
SET
"good_jobs"."locked_by_id" = $1,
"good_jobs"."locked_at" = $2
FROM rows
WHERE
"good_jobs".'id' = "rows"."id"
RETURNING
* This is adapted from QueueClassic's query: https://github.com/QueueClassic/queue_classic/blob/ac45544c1368176503b976dad42dd0aee0be5334/lib/queue_classic/queue.rb#L92 And from my testing, this index fits fairly well (doing an index-scan on the t.index ["priority", "created_at", "scheduled_at"],
name: "index_good_jobs_dequeue",
where: "((finished_at IS NULL) AND (locked_by_id IS NULL))" Note: the next version of GoodJob changes from Other questions:
The transitional queryWhat is a query that will lock records both by row, and with an advisory lock so that j |
Beta Was this translation helpful? Give feedback.
-
I'm not committed to this, but I'm seriously considering it and wanted to make sure I'm approaching it comprehensively.
Currently
GoodJob uses session-based Advisory Locks to "lock" jobs and ensure the same job isn't executed in parallel by multiple threads/processes. These Advisory Locks are also used for Concurrency Control and Batches to avoid race conditions.
Benefits of Session Based Advisory Locks:
Downsides of Session-based Advisory Locks
priority, scheduled_at
These are the two downsides I've seen given as reasons not to use GoodJob, though I'm also aware of some fairly large GoodJob deployments (10M+ jobs/day) that themselves haven't raised concerns. And also I know people have had performance problems with queue_classic which seems to be the best at performant job locking. So maybe PGBouncer compatibility is the more important use-case.
Design Changes
I don't think a lot would have to change with GoodJob to move away from connection-based advisory locks.
locked_by_process_id
column to indicate that they were locked.where locked_at IS NULL or locked_at < NOW() - (timeout)
).LISTEN/NOTIFY
which does require a dedicated database session. Though that would be 1 dedicated connection per process, versus the current requirement of 1 per thread.Incremental value and de-risking
Here are some changes that could be made that would provide some value in the current system and would make the overall change easier.
good_jobs
table; Advisory Locks prevent parallel execution because when job execution triggers a retry, that newly inserted retry record is implicitly locked by the first job execution's advisory lock because both execution records have the sameactive_job_id
. The change would be to have thegood_jobs
table be unique onactive_job_id
and have a separategood_job_executions
table that would only be inserted into when a job starts executing/retrying. This would probably be a spicy change to do with zero downtime.Opportunity costs
Doing this will take labor. The thoughts I have are:
Final thoughts
Having written this out, I don't think I'm closer to committing to the big change. But in the sense of "make the change easy, then make the easy change" I'll start trying to insert the incremental pieces into my work stream (alongside the other features/improvements I want to work on) and eventually it may be an easier decision to make.
Beta Was this translation helpful? Give feedback.
All reactions