Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: increase workflow run pop timeout, fix broken concurrency query #1189

Merged
merged 5 commits into from
Jan 21, 2025

Conversation

abelanger5
Copy link
Contributor

Description

Fixes:

  • Adds a limit to the PopWorkflowRuns... query so we don't accidentally update too many rows (we should potentially increase this further)
  • Increases the timeout allocated to this query
  • Fixes a random seqnum cross product issue which was making the query slow

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Copy link

vercel bot commented Jan 16, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
hatchet-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 21, 2025 9:19pm

@abelanger5 abelanger5 requested a review from grutt January 16, 2025 03:51
@reillyse
Copy link
Contributor

How often do we call this function in regular operation? I feel that 30 seconds is probably way too long for a DB timeout - if we start going that long something is definitely hosed - and we'll probably have a ton of other operations queued up behind us so it's not going to help us.
I'd rather error out earlier, and/or reduce the limit (although I feel like 500 should be manageable now that we don't have the cross product).

@reillyse
Copy link
Contributor

I think there is a potential for starvation / unfairness here if we have a very large number of concurrency groups (which I think is the issue here right?

I think our limit needs to be at least
COUNT(DISTINCT "concurrencyGroupId") FROM workflow_runs

so that we don't ignore any concurrency groups that are at the end (say the 501st concurrency group if we had one for everything)

Also I think our limit needs to be a multiple of COUNT(DISTINCT "concurrencyGroupId") FROM workflow_runs because otherwise we'll unfairly schedule the first groups in the ordering.

@reillyse
Copy link
Contributor

also, I feel like I'll need to add an obligatory nit about the github actions not passing to save @grutt the hassle 😆

@abelanger5
Copy link
Contributor Author

How often do we call this function in regular operation? I feel that 30 seconds is probably way too long for a DB timeout - if we start going that long something is definitely hosed - and we'll probably have a ton of other operations queued up behind us so it's not going to help us. I'd rather error out earlier, and/or reduce the limit (although I feel like 500 should be manageable now that we don't have the cross product).

This method is called very often with high usage of concurrency keys -- the problem with this query is that is scales linearly with the number of queued workflow runs and concurrency keys. So we'd likely expect to see > 5 seconds if we've enqueued > 1 million workflow runs that have very distinct concurrency keys. And the thing that concerns me about a low timeout is that we may be cancelling work after the database has spent a bunch of cycles nearly completing a query (and if this query times out, there's no recovering the queue).

and we'll probably have a ton of other operations queued up behind us so it's not going to help us

this should be written in a way where there aren't other operations queued up behind us, so I'm less concerned about this

I think our limit needs to be at least
COUNT(DISTINCT "concurrencyGroupId") FROM workflow_runs

so that we don't ignore any concurrency groups that are at the end (say the 501st concurrency group if we had one for everything)

I'm not sure I follow -- yes, the limit will only pop 500 workflow runs at a time, but it shouldn't be unfair because seqnum is ordered by the workflow run's createdAt/insertOrder, so the next time we call the query the 501st workflow run from the 501st concurrency group should be in the first slot, unless I'm missing something?

Copy link
Contributor

@reillyse reillyse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving a review cause I think I just commented before. I think with the changes we talked about on the call today this should be good.

@abelanger5 abelanger5 enabled auto-merge (squash) January 21, 2025 21:18
@abelanger5 abelanger5 disabled auto-merge January 21, 2025 23:24
@abelanger5 abelanger5 merged commit b691117 into main Jan 21, 2025
30 checks passed
@abelanger5 abelanger5 deleted the belanger/increase-pop-timeout branch January 21, 2025 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants