[jobs] revamp scheduling for managed jobs #4485

cg505 · 2024-12-19T01:51:23Z

Detaches the job controller from ray worker and the ray driver program, and uses our own scheduling and parallelism control mechanism, derived from the state tracked in the managed jobs sqlite database on the controller.

See the commands in sky/jobs/scheduler.py for more info.

Previously, the number of simultaneous jobs is limited to 4x CPU count, by our per-job ray placement group request

skypilot/sky/skylet/constants.py

Line 293 in 1578108

CONTROLLER_PROCESS_CPU_DEMAND = 0.25

After this PR, there are two paralellism limits:
4 * cpu_count jobs can be launching at the same time.
memory / 350M jobs can be running at the same time.

Common and max instance sizes and their parallelism limits

instance type	vCPUs	memory (GB)	old job parallelism	(new) launch parallelism	(new) run parallelism
m6i.large / Standard_D2s_v5 / n2-standard-2	2	8	8 launching/running at once	8 launches at once	22 running at once
r6i.large / Standard_E2s_v5 / n2-highmem-2	2	16	8 launching/running at once	8 launches at once	44 running at once
m6i.2xlarge / Standard_D8s_v2 / n2-standard-8	8	32	32 launching/running at once	32 launches at once	90 running at once
Standard_E96s_v5	96	672	384 launching/running at once	384 launches at once	~1930 running at once
n2-highmem-128	128	864	512 launching/running at once	512 launches at once	~2480 running at once
r6i.32xlarge	128	1024	512 launching/running at once	512 launches at once	~2950 running at once

run parallelism varies slightly between clouds as instances listed with the same amount of memory do not actually have exactly the same number of bytes.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: `conda deactivate; bash -i tests/backward_compatibility_tests.

sky/jobs/scheduler.py

sky/jobs/state.py

cg505 · 2024-12-19T02:34:41Z

sky/jobs/scheduler.py

+                os.makedirs(logs_dir, exist_ok=True)
+                log_path = os.path.join(logs_dir, f'{managed_job_id}.log')
+
+                pid = subprocess_utils.launch_new_process_tree(


if scheduler is killed before this line (e.g. when running as part of a controller job), we will get stuck since the job will be submitted but the controller will never start. Todo figure out how to recover from this case

We can have a skylet event to monitor managed job table, like we do for normal unmanaged jobs.

We are already using the exiting managed job skylet event for that, but the problem is that if it dies right here, there's no way to know if the scheduler is just about to start the process or if it already died. We need a way to check if the scheduler died or maybe a timestamp for the WAITING -> LAUNCHING transition.

Michaelvll

Thanks @cg505 for making this significant change! This is awesome! I glanced the code, and it mostly looks good. The main concern is the complexity and granularity we have for limiting the number of launches. Please see the comments below.

sky/backends/cloud_vm_ray_backend.py

sky/jobs/constants.py

sky/jobs/scheduler.py

Michaelvll · 2024-12-19T08:29:57Z

sky/jobs/scheduler.py

+                os.makedirs(logs_dir, exist_ok=True)
+                log_path = os.path.join(logs_dir, f'{managed_job_id}.log')
+
+                pid = subprocess_utils.launch_new_process_tree(


We can have a skylet event to monitor managed job table, like we do for normal unmanaged jobs.

sky/jobs/scheduler.py

…d-jobs-skylet

cg505 · 2024-12-20T06:47:25Z

/quicktest-core

Michaelvll

Thanks @cg505! This PR looks pretty good to me! We should do some thorough test with managed jobs, especially testing for:

scheduling speed for jobs
special cases that may get the scheduling stuck
many jobs
cancellation of jobs
in parallel jobs scheduling

Michaelvll · 2025-01-02T21:06:22Z

sky/jobs/controller.py

@@ -191,6 +190,8 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool:
            f'Submitted managed job {self._job_id} (task: {task_id}, name: '
            f'{task.name!r}); {constants.TASK_ID_ENV_VAR}: {task_id_env_var}')

+        scheduler.wait_until_launch_okay(self._job_id)


The new API looks much better than before. Maybe we can turn this into a context so as to combine the wait and finish

self._strategy_executor.launch() may call scheduler.launch_finished and scheduler.wait_until_launch_okay in the recovery case, so I feel like the context wouldn't really be accurate.

sky/jobs/state.py

sky/jobs/scheduler.py

Michaelvll · 2025-01-03T01:43:40Z

sky/jobs/utils.py

    ]
    if show_all:
-        columns += ['STARTED', 'CLUSTER', 'REGION', 'FAILURE']
+        columns += ['STARTED', 'CLUSTER', 'REGION', 'FAILURE', 'SCHED. STATE']


nit: I would prefer to not have the sched. state column, instead, we may want to do something similar as kubectl describe pod where it shows detailed description of what the pod is working on in the same state. For example, we can maybe rename the FAILURE column to be DESCRIPTION.

Don't want to spend too much time on this but I'll take a look.

Yep, don't need to be a large change. Just adding the state as a description in the FAILURE column (now should rename to DESCRIPTION

sky/jobs/utils.py

Michaelvll · 2025-01-09T06:05:05Z

/smoke-test managed_jobs

zpoint · 2025-01-09T06:58:45Z

Need to merge this PR to get smoke-test comment work
I have resolved the comment, could u help take a look again? @Michaelvll

revamp scheduling for managed jobs

78eef52

cg505 commented Dec 19, 2024

View reviewed changes

sky/jobs/scheduler.py Outdated Show resolved Hide resolved

cg505 commented Dec 19, 2024

View reviewed changes

sky/jobs/state.py Outdated Show resolved Hide resolved

cg505 commented Dec 19, 2024

View reviewed changes

cg505 mentioned this pull request Dec 19, 2024

detach the managed job controller from job submission #4458

Closed

6 tasks

Michaelvll reviewed Dec 19, 2024

View reviewed changes

cg505 added 2 commits December 19, 2024 21:19

simplify locking mechanism

4c54642

additional fixes

aeaaf7b

cg505 marked this pull request as ready for review December 20, 2024 05:34

cg505 requested a review from Michaelvll December 20, 2024 05:34

cg505 changed the title ~~revamp scheduling for managed jobs~~ [jobs/ revamp scheduling for managed jobs Dec 20, 2024

cg505 changed the title ~~[jobs/ revamp scheduling for managed jobs~~ [jobs] revamp scheduling for managed jobs Dec 20, 2024

cg505 added 2 commits December 19, 2024 22:19

fix pid writing

3b4cf44

Merge branch 'master' of github.com:skypilot-org/skypilot into manage…

ea84bb4

…d-jobs-skylet

Michaelvll reviewed Jan 3, 2025

View reviewed changes

address PR comments

a255236

cg505 requested a review from Michaelvll January 7, 2025 21:32

cg505 added 2 commits January 8, 2025 15:49

fix jobs logs --controller

0c54fc6

catch inconsistent schedule state in update_managed_job_status

734fdc2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jobs] revamp scheduling for managed jobs #4485

[jobs] revamp scheduling for managed jobs #4485

cg505 commented Dec 19, 2024 •

edited

Loading

cg505 Dec 19, 2024

Michaelvll Dec 19, 2024

cg505 Dec 20, 2024

Michaelvll left a comment

Michaelvll Dec 19, 2024

cg505 commented Dec 20, 2024

Michaelvll left a comment

Michaelvll Jan 2, 2025

cg505 Jan 3, 2025

Michaelvll Jan 3, 2025

cg505 Jan 3, 2025

Michaelvll Jan 3, 2025

Michaelvll commented Jan 9, 2025

zpoint commented Jan 9, 2025

[jobs] revamp scheduling for managed jobs #4485

Are you sure you want to change the base?

[jobs] revamp scheduling for managed jobs #4485

Conversation

cg505 commented Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cg505 commented Dec 20, 2024

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Jan 9, 2025

zpoint commented Jan 9, 2025

cg505 commented Dec 19, 2024 •

edited

Loading