-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefect Workers crash when server returns 500s #16977
Comments
Hey @ashtuchkin - thank you for the detailed bug report! First question I have: could you share a stack trace including the exact status codes you're seeing? That would help me with error-type, etc. when updating the base worker logic. The underlying client does have some amount of finite retry logic for certain status codes that can be augmented through the Example of setting:
|
Sure! Here's an example exception after which worker exits:
|
Thanks for the pointer to |
Bug summary
We run a pretty big self-hosted installation of Prefect 2.x (20k flow runs/day, 200k tasks) and noticed that when self-hosted API server becomes overloaded, it starts returning HTTP 500s.
That's OK by itself, but this makes Workers (we use Kubernetes) exit unexpectedly, then get restarted by K8s (issue 1).
Specifically we see the following problematic stack traces after which it exits:
Restarting also seems OK by itself, however we noticed that if a different flow was marked PENDING, but no K8s Job was scheduled yet when the worker exited, it'll be stuck in PENDING forever (issue 2). Here's the relevant code:
prefect/src/prefect/workers/base.py
Lines 972 to 977 in c4ac231
Issue 2 seems relatively hard to fully resolve, as it's impossible to atomically mark flow as pending and submit a job to K8s. Maybe we can do something by storing the state locally, but that will not work if the pod is restarted on a different node.
Issue 1 looks more straightforward though. There's already a try/except around these places, but it only catches some exceptions, not all of them. Hopefully it'd be easy to resolve.
Version info
Additional context
No response
The text was updated successfully, but these errors were encountered: