Prefect Workers crash when server returns 500s #16977

ashtuchkin · 2025-02-05T09:17:03Z

Bug summary

We run a pretty big self-hosted installation of Prefect 2.x (20k flow runs/day, 200k tasks) and noticed that when self-hosted API server becomes overloaded, it starts returning HTTP 500s.

That's OK by itself, but this makes Workers (we use Kubernetes) exit unexpectedly, then get restarted by K8s (issue 1).
Specifically we see the following problematic stack traces after which it exits:

_submit_run -> _check_flow_run -> read_deployment
cancel_run -> _get_configuration -> read_deployment

Restarting also seems OK by itself, however we noticed that if a different flow was marked PENDING, but no K8s Job was scheduled yet when the worker exited, it'll be stuck in PENDING forever (issue 2). Here's the relevant code:

prefect/src/prefect/workers/base.py

Lines 972 to 977 in c4ac231

    
           ready_to_submit = await self._propose_pending_state(flow_run) 
        
           self._logger.debug(f"Ready to submit {flow_run.id}: {ready_to_submit}") 
        
           if ready_to_submit: 
        
               readiness_result = await self._runs_task_group.start( 
        
                   self._submit_run_and_capture_errors, flow_run 
        
               )

Issue 2 seems relatively hard to fully resolve, as it's impossible to atomically mark flow as pending and submit a job to K8s. Maybe we can do something by storing the state locally, but that will not work if the pod is restarted on a different node.

Issue 1 looks more straightforward though. There's already a try/except around these places, but it only catches some exceptions, not all of them. Hopefully it'd be easy to resolve.

Version info

Version:             2.20.14
API version:         0.8.4
Python version:      3.11.7
Git commit:          fb919c67
Built:               Mon, Nov 18, 2024 4:41 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.40.1

Additional context

No response

The text was updated successfully, but these errors were encountered:

cicdw · 2025-02-05T17:46:35Z

Hey @ashtuchkin - thank you for the detailed bug report!

First question I have: could you share a stack trace including the exact status codes you're seeing? That would help me with error-type, etc. when updating the base worker logic.

The underlying client does have some amount of finite retry logic for certain status codes that can be augmented through the PREFECT_CLIENT_RETRY_EXTRA_CODES setting. If your status code is one not listed here, adding that could be a temporary bandaid while we work on a more robust fix.

Example of setting:

PREFECT_CLIENT_RETRY_EXTRA_CODES='500,421'

ashtuchkin · 2025-02-05T18:35:55Z

Sure! Here's an example exception after which worker exits:

  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.11/site-packages/prefect/cli/_utilities.py", line 42, in wrapper
  |     return fn(*args, **kwargs)
  |            ^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 311, in coroutine_wrapper
  |     return call()
  |            ^^^^^^
  |   File "/usr/local/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 432, in __call__
  |     return self.result()
  |            ^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 318, in result
  |     return self.future.result(timeout=timeout)
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 179, in result
  |     return self.__get_result()
  |            ^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
  |     raise self._exception
  |   File "/usr/local/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 389, in _run_async
  |     result = await coro
  |              ^^^^^^^^^^
  |   File "/usr/local/lib/python3.11/site-packages/prefect/cli/worker.py", line 169, in start
  |     async with worker_cls(
  |   File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 1143, in __aexit__
  |     await self.teardown(*exc_info)
  |   File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 666, in teardown
  |     await super().teardown(*exc_info)
  |   File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 541, in teardown
  |     await self._runs_task_group.__aexit__(*exc_info)
  |   File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 736, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 871, in _submit_run
    |     await self._check_flow_run(flow_run)
    |   File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 854, in _check_flow_run
    |     deployment = await self._client.read_deployment(flow_run.deployment_id)
    |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/usr/local/lib/python3.11/site-packages/prefect/client/orchestration.py", line 1783, in read_deployment
    |     response = await self._client.get(f"/deployments/{deployment_id}")
    |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1814, in get
    |     return await self.request(
    |            ^^^^^^^^^^^^^^^^^^^
    |   File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1585, in request
    |     return await self.send(request, auth=auth, follow_redirects=follow_redirects)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/usr/local/lib/python3.11/site-packages/prefect/client/base.py", line 358, in send
    |     response.raise_for_status()
    |   File "/usr/local/lib/python3.11/site-packages/prefect/client/base.py", line 171, in raise_for_status
    |     raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.__cause__
    | prefect.exceptions.PrefectHTTPStatusError: Server error '500 Internal Server Error' for url 'http://prefect-server:4200/api/deployments/9f495603-99ef-43c9-9a6d-6e5f18660219'
    | Response: {'exception_message': 'Internal Server Error'}
    | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500
    +------------------------------------

ashtuchkin · 2025-02-05T18:40:23Z

Thanks for the pointer to PREFECT_CLIENT_RETRY_EXTRA_CODES! We'll use that when we bump into this next time. For now we've increased the server deployment size to try to hopefully avoid 500s altoghether.

ashtuchkin added the bug Something isn't working label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefect Workers crash when server returns 500s #16977

Prefect Workers crash when server returns 500s #16977

ashtuchkin commented Feb 5, 2025

cicdw commented Feb 5, 2025

ashtuchkin commented Feb 5, 2025

ashtuchkin commented Feb 5, 2025

Prefect Workers crash when server returns 500s #16977

Prefect Workers crash when server returns 500s #16977

Comments

ashtuchkin commented Feb 5, 2025

Bug summary

Version info

Additional context

cicdw commented Feb 5, 2025

ashtuchkin commented Feb 5, 2025

ashtuchkin commented Feb 5, 2025