[jobs] make status updates robust when controller dies #4602

cg505 · 2025-01-21T22:07:11Z

Jobs cannot get stuck in CANCELLING - it's no longer "terminal".
We use schedule_state rather than job status to determine whether a controller has exited cleanly. This allows us to reliably see if the controller crashed and simplifies some of the checking logic.
Even if jobs are in a terminal status (including SUCCEEDED), we can still set them to FAILED_CONTROLLER if the controller died abnormally, e.g. during cleanup.

In combination with #4552 and #4562, the internal state machine for job status and schedule state should be much more robust and likely to eventually get to a consistent state, even under high load.

Tested (run the relevant ones):

Code formatting: bash format.sh
Manual load test on AWS with r6i.24xlarge controller and ~1400 jobs cancelled.
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

This discrepancy caused issues, such as jobs getting stuck as CANCELLING when the job controller process crashes during cleanup.

cg505 · 2025-01-21T22:07:19Z

/quicktest-core

cg505 · 2025-01-21T22:14:36Z

/quicktest-core

Michaelvll

Thanks @cg505! This PR looks mostly good to me.

Michaelvll · 2025-01-22T00:06:23Z

sky/jobs/state.py

+            {set_str}
+            WHERE spot_job_id=(?)""", (now, *fields_to_set.values(), job_id))
+    if callback_func:
+        callback_func('FAILED')


Suggested change

callback_func('FAILED')

callback_func('FAILED_CONTROLLER')

Michaelvll · 2025-01-22T00:07:51Z

sky/jobs/state.py

+        cursor.execute(
+            f"""\
+            UPDATE spot SET
+            end_at = COALESCE(end_at, ?),


Seems the only changes is here. Should we just incorporate the changes into the set_failed, or create function for the shared code against set_failed`?

Michaelvll · 2025-01-22T00:14:26Z

sky/jobs/state.py

@@ -677,6 +723,47 @@ def get_schedule_live_jobs(job_id: Optional[int]) -> List[Dict[str, Any]]:
        return jobs


+def get_jobs_to_check(job_id: Optional[int] = None) -> List[int]:


Suggested change

def get_jobs_to_check(job_id: Optional[int] = None) -> List[int]:

def get_jobs_to_check_status(job_id: Optional[int] = None) -> List[int]:

Michaelvll · 2025-01-22T00:15:43Z

sky/jobs/state.py

+    field_values = [
+        status.value for status in ManagedJobStatus.terminal_statuses()
+    ]


Suggested change

field_values = [

status.value for status in ManagedJobStatus.terminal_statuses()

]

terminal_status_values = [

status.value for status in ManagedJobStatus.terminal_statuses()

]

Michaelvll · 2025-01-22T06:41:30Z

sky/jobs/utils.py

+                    f'Legacy controller process for {job_id} exited '
+                    f'abnormally, and cleanup failed: {cleanup_error}. For '
+                    f'more details, run: sky jobs logs --controller {job_id}')


Suggested change

f'Legacy controller process for {job_id} exited '

f'abnormally, and cleanup failed: {cleanup_error}. For '

f'more details, run: sky jobs logs --controller {job_id}')

f'Legacy controller process has exited abnormally, and cleanup '

'failed: {cleanup_error}. For more details, run: '

f'sky jobs logs --controller {job_id}')

Michaelvll · 2025-01-22T06:46:06Z

sky/jobs/utils.py

+        if (schedule_state is None or schedule_state is
+                managed_job_state.ManagedJobScheduleState.INVALID):


Why there can be either None or INVALID? I thought we have made all legacy state to be INVALID?

Michaelvll · 2025-01-22T06:58:44Z

sky/jobs/state.py

@@ -230,12 +230,12 @@ class ManagedJobStatus(enum.Enum):
    # RECOVERING: The cluster is preempted, and the controller process is
    # recovering the cluster (relaunching/failover).
    RECOVERING = 'RECOVERING'
-    # Terminal statuses
-    # SUCCEEDED: The job is finished successfully.
-    SUCCEEDED = 'SUCCEEDED'
    # CANCELLING: The job is requested to be cancelled by the user, and the
    # controller is cleaning up the cluster.
    CANCELLING = 'CANCELLING'


For jobs.utils.stream_logs_by_id(), we should exit the streaming once the state enters the CANCELLING state. : )

added fixes

Michaelvll

Thanks @cg505! It looks good to me.

Michaelvll · 2025-01-23T07:10:49Z

sky/jobs/state.py

+                WHERE spot_job_id=(?) {task_query_str}""",
+                (end_time, *list(fields_to_set.values()), job_id, *task_value))
+        else:
+            # Only set if end_at is null.


Suggested change

# Only set if end_at is null.

# Only set if end_at is null, i.e. the previous state is not terminal.

Michaelvll · 2025-01-23T07:12:10Z

sky/jobs/state.py

+    Args:
+        job_id: Optional job ID to check. If None, checks all jobs.


minor: Args normally go before Returns

Michaelvll · 2025-01-23T07:15:33Z

sky/jobs/utils.py

+                    terminate_cluster(cluster_name)
+                except Exception as e:  # pylint: disable=broad-except
+                    error_msg = (f'Failed to terminate cluster {cluster_name}: '
+                                 f'{str(e)}')


Exception type can be an important information to log

Suggested change

f'{str(e)}')

f'{common_utils.format_exception(e, use_bracket=True)}')

Michaelvll · 2025-01-23T07:22:35Z

sky/jobs/utils.py

+        # If we see CANCELLING, just exit - we could miss some job logs but the
+        # job will be terminated momentarily anyway so we don't really care.
+        return (not status.is_terminal() and
+                status is not managed_job_state.ManagedJobStatus.CANCELLING)


For enum, use != instead of is not.

Michaelvll · 2025-01-23T07:24:19Z

sky/jobs/utils.py

                continue
-            assert managed_job_status is not None
+            assert (managed_job_status is


enum should be compared with ==

Michaelvll · 2025-01-23T07:25:01Z

sky/jobs/utils.py

@@ -623,14 +669,16 @@ def is_managed_job_status_updated(
            # managed job state is updated.
            time.sleep(3 * JOB_STATUS_CHECK_GAP_SECONDS)
            managed_job_status = managed_job_state.get_status(job_id)
+            assert managed_job_status is not None, (job_id, managed_job_status)
+            should_keep_logging(managed_job_status)


why do we call this?

Michaelvll · 2025-01-23T07:26:47Z

sky/jobs/utils.py

-        'Waiting for controller process to be RUNNING') + '{status_str}'
-    status_display = rich_utils.safe_status(status_msg.format(status_str=''))
+
+    def should_keep_logging(status: managed_job_state.ManagedJobStatus) -> bool:


minor: Should we allow this function to take Optional[managed_job_state.ManagedJobStatus, so that we don't have to assert before every invocation of this function?

cg505 added 2 commits January 18, 2025 17:10

[jobs] CANCELLING is not terminal

28d1ec8

This discrepancy caused issues, such as jobs getting stuck as CANCELLING when the job controller process crashes during cleanup.

revamp nonterminal status checking

e9af1ca

cg505 requested a review from Michaelvll January 21, 2025 22:07

lint

dfb405d

Michaelvll reviewed Jan 22, 2025

View reviewed changes

cg505 added 3 commits January 22, 2025 16:21

fix stream_logs_by_id

6f56941

remove set_failed_controller

06864e2

address PR comments

838d6a7

cg505 requested a review from Michaelvll January 23, 2025 02:16

Michaelvll approved these changes Jan 23, 2025

View reviewed changes

Michaelvll added this to the v0.8.0 milestone Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jobs] make status updates robust when controller dies #4602

[jobs] make status updates robust when controller dies #4602

cg505 commented Jan 21, 2025 •

edited

Loading

cg505 commented Jan 21, 2025

cg505 commented Jan 21, 2025

Michaelvll left a comment

Michaelvll Jan 22, 2025

Michaelvll Jan 22, 2025

Michaelvll Jan 22, 2025

Michaelvll Jan 22, 2025

Michaelvll Jan 22, 2025

Michaelvll Jan 22, 2025

Michaelvll Jan 22, 2025

cg505 Jan 23, 2025

Michaelvll left a comment

Michaelvll Jan 23, 2025

Michaelvll Jan 23, 2025

Michaelvll Jan 23, 2025

Michaelvll Jan 23, 2025

Michaelvll Jan 23, 2025

Michaelvll Jan 23, 2025

Michaelvll Jan 23, 2025

		@@ -677,6 +723,47 @@ def get_schedule_live_jobs(job_id: Optional[int]) -> List[Dict[str, Any]]:
		return jobs


		def get_jobs_to_check(job_id: Optional[int] = None) -> List[int]:

	def get_jobs_to_check(job_id: Optional[int] = None) -> List[int]:
	def get_jobs_to_check_status(job_id: Optional[int] = None) -> List[int]:

		if (schedule_state is None or schedule_state is
		managed_job_state.ManagedJobScheduleState.INVALID):

	# Only set if end_at is null.
	# Only set if end_at is null, i.e. the previous state is not terminal.

		Args:
		job_id: Optional job ID to check. If None, checks all jobs.

	f'{str(e)}')
	f'{common_utils.format_exception(e, use_bracket=True)}')

[jobs] make status updates robust when controller dies #4602

Are you sure you want to change the base?

[jobs] make status updates robust when controller dies #4602

Conversation

cg505 commented Jan 21, 2025 • edited Loading

cg505 commented Jan 21, 2025

cg505 commented Jan 21, 2025

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cg505 commented Jan 21, 2025 •

edited

Loading