-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jobs] make status updates robust when controller dies #4602
base: master
Are you sure you want to change the base?
Conversation
This discrepancy caused issues, such as jobs getting stuck as CANCELLING when the job controller process crashes during cleanup.
/quicktest-core |
/quicktest-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505! This PR looks mostly good to me.
{set_str} | ||
WHERE spot_job_id=(?)""", (now, *fields_to_set.values(), job_id)) | ||
if callback_func: | ||
callback_func('FAILED') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
callback_func('FAILED') | |
callback_func('FAILED_CONTROLLER') |
sky/jobs/state.py
Outdated
cursor.execute( | ||
f"""\ | ||
UPDATE spot SET | ||
end_at = COALESCE(end_at, ?), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the only changes is here. Should we just incorporate the changes into the set_failed, or create function for the shared code against
set_failed`?
sky/jobs/state.py
Outdated
@@ -677,6 +723,47 @@ def get_schedule_live_jobs(job_id: Optional[int]) -> List[Dict[str, Any]]: | |||
return jobs | |||
|
|||
|
|||
def get_jobs_to_check(job_id: Optional[int] = None) -> List[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_jobs_to_check(job_id: Optional[int] = None) -> List[int]: | |
def get_jobs_to_check_status(job_id: Optional[int] = None) -> List[int]: |
sky/jobs/state.py
Outdated
field_values = [ | ||
status.value for status in ManagedJobStatus.terminal_statuses() | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
field_values = [ | |
status.value for status in ManagedJobStatus.terminal_statuses() | |
] | |
terminal_status_values = [ | |
status.value for status in ManagedJobStatus.terminal_statuses() | |
] |
sky/jobs/utils.py
Outdated
f'Legacy controller process for {job_id} exited ' | ||
f'abnormally, and cleanup failed: {cleanup_error}. For ' | ||
f'more details, run: sky jobs logs --controller {job_id}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f'Legacy controller process for {job_id} exited ' | |
f'abnormally, and cleanup failed: {cleanup_error}. For ' | |
f'more details, run: sky jobs logs --controller {job_id}') | |
f'Legacy controller process has exited abnormally, and cleanup ' | |
'failed: {cleanup_error}. For more details, run: ' | |
f'sky jobs logs --controller {job_id}') |
sky/jobs/utils.py
Outdated
if (schedule_state is None or schedule_state is | ||
managed_job_state.ManagedJobScheduleState.INVALID): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why there can be either None
or INVALID
? I thought we have made all legacy state to be INVALID
?
@@ -230,12 +230,12 @@ class ManagedJobStatus(enum.Enum): | |||
# RECOVERING: The cluster is preempted, and the controller process is | |||
# recovering the cluster (relaunching/failover). | |||
RECOVERING = 'RECOVERING' | |||
# Terminal statuses | |||
# SUCCEEDED: The job is finished successfully. | |||
SUCCEEDED = 'SUCCEEDED' | |||
# CANCELLING: The job is requested to be cancelled by the user, and the | |||
# controller is cleaning up the cluster. | |||
CANCELLING = 'CANCELLING' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For jobs.utils.stream_logs_by_id()
, we should exit the streaming once the state enters the CANCELLING
state. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added fixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505! It looks good to me.
WHERE spot_job_id=(?) {task_query_str}""", | ||
(end_time, *list(fields_to_set.values()), job_id, *task_value)) | ||
else: | ||
# Only set if end_at is null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Only set if end_at is null. | |
# Only set if end_at is null, i.e. the previous state is not terminal. |
Args: | ||
job_id: Optional job ID to check. If None, checks all jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Args
normally go before Returns
terminate_cluster(cluster_name) | ||
except Exception as e: # pylint: disable=broad-except | ||
error_msg = (f'Failed to terminate cluster {cluster_name}: ' | ||
f'{str(e)}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exception type can be an important information to log
f'{str(e)}') | |
f'{common_utils.format_exception(e, use_bracket=True)}') |
# If we see CANCELLING, just exit - we could miss some job logs but the | ||
# job will be terminated momentarily anyway so we don't really care. | ||
return (not status.is_terminal() and | ||
status is not managed_job_state.ManagedJobStatus.CANCELLING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For enum, use !=
instead of is not
.
continue | ||
assert managed_job_status is not None | ||
assert (managed_job_status is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enum should be compared with ==
@@ -623,14 +669,16 @@ def is_managed_job_status_updated( | |||
# managed job state is updated. | |||
time.sleep(3 * JOB_STATUS_CHECK_GAP_SECONDS) | |||
managed_job_status = managed_job_state.get_status(job_id) | |||
assert managed_job_status is not None, (job_id, managed_job_status) | |||
should_keep_logging(managed_job_status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we call this?
'Waiting for controller process to be RUNNING') + '{status_str}' | ||
status_display = rich_utils.safe_status(status_msg.format(status_str='')) | ||
|
||
def should_keep_logging(status: managed_job_state.ManagedJobStatus) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Should we allow this function to take Optional[managed_job_state.ManagedJobStatus
, so that we don't have to assert before every invocation of this function?
In combination with #4552 and #4562, the internal state machine for job status and schedule state should be much more robust and likely to eventually get to a consistent state, even under high load.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh