fix: increase workflow run pop timeout, fix broken concurrency query #1189

abelanger5 · 2025-01-16T03:51:20Z

Description

Fixes:

Adds a limit to the PopWorkflowRuns... query so we don't accidentally update too many rows (we should potentially increase this further)
Increases the timeout allocated to this query
Fixes a random seqnum cross product issue which was making the query slow

Type of change

Bug fix (non-breaking change which fixes an issue)

vercel · 2025-01-16T03:51:24Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
hatchet-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 21, 2025 9:19pm

reillyse · 2025-01-18T02:01:22Z

How often do we call this function in regular operation? I feel that 30 seconds is probably way too long for a DB timeout - if we start going that long something is definitely hosed - and we'll probably have a ton of other operations queued up behind us so it's not going to help us.
I'd rather error out earlier, and/or reduce the limit (although I feel like 500 should be manageable now that we don't have the cross product).

reillyse · 2025-01-18T02:19:42Z

I think there is a potential for starvation / unfairness here if we have a very large number of concurrency groups (which I think is the issue here right?

I think our limit needs to be at least
COUNT(DISTINCT "concurrencyGroupId") FROM workflow_runs

so that we don't ignore any concurrency groups that are at the end (say the 501st concurrency group if we had one for everything)

Also I think our limit needs to be a multiple of COUNT(DISTINCT "concurrencyGroupId") FROM workflow_runs because otherwise we'll unfairly schedule the first groups in the ordering.

reillyse · 2025-01-18T02:25:00Z

also, I feel like I'll need to add an obligatory nit about the github actions not passing to save @grutt the hassle 😆

abelanger5 · 2025-01-21T15:14:15Z

How often do we call this function in regular operation? I feel that 30 seconds is probably way too long for a DB timeout - if we start going that long something is definitely hosed - and we'll probably have a ton of other operations queued up behind us so it's not going to help us. I'd rather error out earlier, and/or reduce the limit (although I feel like 500 should be manageable now that we don't have the cross product).

This method is called very often with high usage of concurrency keys -- the problem with this query is that is scales linearly with the number of queued workflow runs and concurrency keys. So we'd likely expect to see > 5 seconds if we've enqueued > 1 million workflow runs that have very distinct concurrency keys. And the thing that concerns me about a low timeout is that we may be cancelling work after the database has spent a bunch of cycles nearly completing a query (and if this query times out, there's no recovering the queue).

and we'll probably have a ton of other operations queued up behind us so it's not going to help us

this should be written in a way where there aren't other operations queued up behind us, so I'm less concerned about this

I think our limit needs to be at least
COUNT(DISTINCT "concurrencyGroupId") FROM workflow_runs

so that we don't ignore any concurrency groups that are at the end (say the 501st concurrency group if we had one for everything)

I'm not sure I follow -- yes, the limit will only pop 500 workflow runs at a time, but it shouldn't be unfair because seqnum is ordered by the workflow run's createdAt/insertOrder, so the next time we call the query the 501st workflow run from the 501st concurrency group should be in the first slot, unless I'm missing something?

reillyse

Just leaving a review cause I think I just commented before. I think with the changes we talked about on the call today this should be good.

abelanger5 added 2 commits January 15, 2025 18:06

increase pop timeout

e80eb18

fix: dunno

223fd96

abelanger5 requested a review from grutt January 16, 2025 03:51

fix: improve query

a780090

vercel bot deployed to Preview January 16, 2025 04:05 View deployment

abelanger5 requested a review from reillyse January 17, 2025 23:07

fix: whitespace

602cb84

vercel bot deployed to Preview January 21, 2025 15:19 View deployment

reillyse reviewed Jan 21, 2025

View reviewed changes

fix: extend timeout and make pop query pseudo-random

b9d7f45

abelanger5 enabled auto-merge (squash) January 21, 2025 21:18

vercel bot deployed to Preview January 21, 2025 21:19 View deployment

abelanger5 disabled auto-merge January 21, 2025 23:24

abelanger5 merged commit b691117 into main Jan 21, 2025
30 checks passed

abelanger5 deleted the belanger/increase-pop-timeout branch January 21, 2025 23:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: increase workflow run pop timeout, fix broken concurrency query #1189

fix: increase workflow run pop timeout, fix broken concurrency query #1189

abelanger5 commented Jan 16, 2025

vercel bot commented Jan 16, 2025 •

edited

Loading

reillyse commented Jan 18, 2025

reillyse commented Jan 18, 2025

reillyse commented Jan 18, 2025

abelanger5 commented Jan 21, 2025

reillyse left a comment

fix: increase workflow run pop timeout, fix broken concurrency query #1189

fix: increase workflow run pop timeout, fix broken concurrency query #1189

Conversation

abelanger5 commented Jan 16, 2025

Description

Type of change

vercel bot commented Jan 16, 2025 • edited Loading

reillyse commented Jan 18, 2025

reillyse commented Jan 18, 2025

reillyse commented Jan 18, 2025

abelanger5 commented Jan 21, 2025

reillyse left a comment

Choose a reason for hiding this comment

vercel bot commented Jan 16, 2025 •

edited

Loading