Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefect Workers crash when server returns 500s #16977

Open
ashtuchkin opened this issue Feb 5, 2025 · 3 comments
Open

Prefect Workers crash when server returns 500s #16977

ashtuchkin opened this issue Feb 5, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@ashtuchkin
Copy link
Contributor

Bug summary

We run a pretty big self-hosted installation of Prefect 2.x (20k flow runs/day, 200k tasks) and noticed that when self-hosted API server becomes overloaded, it starts returning HTTP 500s.

That's OK by itself, but this makes Workers (we use Kubernetes) exit unexpectedly, then get restarted by K8s (issue 1).
Specifically we see the following problematic stack traces after which it exits:

  • _submit_run -> _check_flow_run -> read_deployment
  • cancel_run -> _get_configuration -> read_deployment

Restarting also seems OK by itself, however we noticed that if a different flow was marked PENDING, but no K8s Job was scheduled yet when the worker exited, it'll be stuck in PENDING forever (issue 2). Here's the relevant code:

ready_to_submit = await self._propose_pending_state(flow_run)
self._logger.debug(f"Ready to submit {flow_run.id}: {ready_to_submit}")
if ready_to_submit:
readiness_result = await self._runs_task_group.start(
self._submit_run_and_capture_errors, flow_run
)

Issue 2 seems relatively hard to fully resolve, as it's impossible to atomically mark flow as pending and submit a job to K8s. Maybe we can do something by storing the state locally, but that will not work if the pod is restarted on a different node.

Issue 1 looks more straightforward though. There's already a try/except around these places, but it only catches some exceptions, not all of them. Hopefully it'd be easy to resolve.

Version info

Version:             2.20.14
API version:         0.8.4
Python version:      3.11.7
Git commit:          fb919c67
Built:               Mon, Nov 18, 2024 4:41 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.40.1

Additional context

No response

@ashtuchkin ashtuchkin added the bug Something isn't working label Feb 5, 2025
@cicdw
Copy link
Member

cicdw commented Feb 5, 2025

Hey @ashtuchkin - thank you for the detailed bug report!

First question I have: could you share a stack trace including the exact status codes you're seeing? That would help me with error-type, etc. when updating the base worker logic.

The underlying client does have some amount of finite retry logic for certain status codes that can be augmented through the PREFECT_CLIENT_RETRY_EXTRA_CODES setting. If your status code is one not listed here, adding that could be a temporary bandaid while we work on a more robust fix.

Example of setting:

PREFECT_CLIENT_RETRY_EXTRA_CODES='500,421'

@ashtuchkin
Copy link
Contributor Author

Sure! Here's an example exception after which worker exits:

  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.11/site-packages/prefect/cli/_utilities.py", line 42, in wrapper
  |     return fn(*args, **kwargs)
  |            ^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 311, in coroutine_wrapper
  |     return call()
  |            ^^^^^^
  |   File "/usr/local/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 432, in __call__
  |     return self.result()
  |            ^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 318, in result
  |     return self.future.result(timeout=timeout)
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 179, in result
  |     return self.__get_result()
  |            ^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
  |     raise self._exception
  |   File "/usr/local/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 389, in _run_async
  |     result = await coro
  |              ^^^^^^^^^^
  |   File "/usr/local/lib/python3.11/site-packages/prefect/cli/worker.py", line 169, in start
  |     async with worker_cls(
  |   File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 1143, in __aexit__
  |     await self.teardown(*exc_info)
  |   File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 666, in teardown
  |     await super().teardown(*exc_info)
  |   File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 541, in teardown
  |     await self._runs_task_group.__aexit__(*exc_info)
  |   File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 736, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 871, in _submit_run
    |     await self._check_flow_run(flow_run)
    |   File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 854, in _check_flow_run
    |     deployment = await self._client.read_deployment(flow_run.deployment_id)
    |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/usr/local/lib/python3.11/site-packages/prefect/client/orchestration.py", line 1783, in read_deployment
    |     response = await self._client.get(f"/deployments/{deployment_id}")
    |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1814, in get
    |     return await self.request(
    |            ^^^^^^^^^^^^^^^^^^^
    |   File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1585, in request
    |     return await self.send(request, auth=auth, follow_redirects=follow_redirects)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/usr/local/lib/python3.11/site-packages/prefect/client/base.py", line 358, in send
    |     response.raise_for_status()
    |   File "/usr/local/lib/python3.11/site-packages/prefect/client/base.py", line 171, in raise_for_status
    |     raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.__cause__
    | prefect.exceptions.PrefectHTTPStatusError: Server error '500 Internal Server Error' for url 'http://prefect-server:4200/api/deployments/9f495603-99ef-43c9-9a6d-6e5f18660219'
    | Response: {'exception_message': 'Internal Server Error'}
    | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500
    +------------------------------------

@ashtuchkin
Copy link
Contributor Author

Thanks for the pointer to PREFECT_CLIENT_RETRY_EXTRA_CODES! We'll use that when we bump into this next time. For now we've increased the server deployment size to try to hopefully avoid 500s altoghether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants