-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jobs] revamp scheduling for managed jobs #4485
base: master
Are you sure you want to change the base?
Conversation
sky/jobs/scheduler.py
Outdated
os.makedirs(logs_dir, exist_ok=True) | ||
log_path = os.path.join(logs_dir, f'{managed_job_id}.log') | ||
|
||
pid = subprocess_utils.launch_new_process_tree( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if scheduler is killed before this line (e.g. when running as part of a controller job), we will get stuck since the job will be submitted but the controller will never start. Todo figure out how to recover from this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have a skylet event to monitor managed job table, like we do for normal unmanaged jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are already using the exiting managed job skylet event for that, but the problem is that if it dies right here, there's no way to know if the scheduler is just about to start the process or if it already died. We need a way to check if the scheduler died or maybe a timestamp for the WAITING -> LAUNCHING transition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505 for making this significant change! This is awesome! I glanced the code, and it mostly looks good. The main concern is the complexity and granularity we have for limiting the number of launches. Please see the comments below.
sky/jobs/scheduler.py
Outdated
os.makedirs(logs_dir, exist_ok=True) | ||
log_path = os.path.join(logs_dir, f'{managed_job_id}.log') | ||
|
||
pid = subprocess_utils.launch_new_process_tree( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have a skylet event to monitor managed job table, like we do for normal unmanaged jobs.
/quicktest-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505! This PR looks pretty good to me! We should do some thorough test with managed jobs, especially testing for:
- scheduling speed for jobs
- special cases that may get the scheduling stuck
- many jobs
- cancellation of jobs
- in parallel jobs scheduling
@@ -191,6 +190,8 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool: | |||
f'Submitted managed job {self._job_id} (task: {task_id}, name: ' | |||
f'{task.name!r}); {constants.TASK_ID_ENV_VAR}: {task_id_env_var}') | |||
|
|||
scheduler.wait_until_launch_okay(self._job_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new API looks much better than before. Maybe we can turn this into a context so as to combine the wait
and finish
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._strategy_executor.launch()
may call scheduler.launch_finished
and scheduler.wait_until_launch_okay
in the recovery case, so I feel like the context wouldn't really be accurate.
sky/jobs/utils.py
Outdated
] | ||
if show_all: | ||
columns += ['STARTED', 'CLUSTER', 'REGION', 'FAILURE'] | ||
columns += ['STARTED', 'CLUSTER', 'REGION', 'FAILURE', 'SCHED. STATE'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would prefer to not have the sched. state
column, instead, we may want to do something similar as kubectl describe pod
where it shows detailed description of what the pod is working on in the same state. For example, we can maybe rename the FAILURE
column to be DESCRIPTION
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't want to spend too much time on this but I'll take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, don't need to be a large change. Just adding the state as a description in the FAILURE
column (now should rename to DESCRIPTION
/smoke-test managed_jobs |
Need to merge this PR to get |
Detaches the job controller from ray worker and the ray driver program, and uses our own scheduling and parallelism control mechanism, derived from the state tracked in the managed jobs sqlite database on the controller.
See the commands in sky/jobs/scheduler.py for more info.
Previously, the number of simultaneous jobs is limited to 4x CPU count, by our per-job ray placement group request
skypilot/sky/skylet/constants.py
Line 293 in 1578108
After this PR, there are two paralellism limits:
4 * cpu_count
jobs can be launching at the same time.memory / 350M
jobs can be running at the same time.Common and max instance sizes and their parallelism limits
run parallelism varies slightly between clouds as instances listed with the same amount of memory do not actually have exactly the same number of bytes.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name