-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Add Azure ML Compute Instance Support #3905
base: master
Are you sure you want to change the base?
Conversation
Added AZ ML Catalog. This is ready for a review! @Michaelvll |
@Michaelvll bump for this |
bump for review @Michaelvll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the support of Azure ML cluster @cblmemo! It seems we are adding some specific logic for Azure ml cluster in Resources
. Can we avoid those?
# Check whether Azure cluster uses Azure ML API | ||
if launched_resources.cloud.is_same_cloud(clouds.Azure()): | ||
task_use_az_ml = skypilot_config.get_nested(('azure', 'use_az_ml'), | ||
False) | ||
cluster_use_az_ml = launched_resources.use_az_ml | ||
if cluster_use_az_ml != task_use_az_ml: | ||
task_str = 'uses' if task_use_az_ml else 'does not use' | ||
cluster_str = 'uses' if cluster_use_az_ml else 'does not use' | ||
with ux_utils.print_exception_no_traceback(): | ||
raise exceptions.ResourcesMismatchError( | ||
f'Task requirements {task_str} Azure ML API, but the ' | ||
f'specified cluster {cluster_name} {cluster_str} it. ' | ||
f'Please set azure.use_az_ml to {cluster_use_az_ml} in ' | ||
'~/.sky/config.yaml.') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use the existing cluster's setting even if the config has changed to use ML API. This is to be consistent with other part of the code, e.g., if a user specify GCP DWS, and having a existing cluster not using DWS, we should still allow task to be submitted to that cluster.
if handle.launched_resources.cloud.is_same_cloud(clouds.Azure()): | ||
task_use_az_ml = skypilot_config.get_nested(('azure', 'use_az_ml'), | ||
False) | ||
cluster_use_az_ml = handle.launched_resources.use_az_ml | ||
if cluster_use_az_ml != task_use_az_ml: | ||
task_str = 'uses' if task_use_az_ml else 'does not use' | ||
cluster_str = 'uses' if cluster_use_az_ml else 'does not use' | ||
with ux_utils.print_exception_no_traceback(): | ||
raise exceptions.ResourcesMismatchError( | ||
f'Current setup {task_str} Azure ML API, but the ' | ||
f'specified cluster {cluster_name} to terminate ' | ||
f'{cluster_str} it. Please set azure.use_az_ml ' | ||
f'to {cluster_use_az_ml} in ~/.sky/config.yaml.') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, we should just allow the teardown using the non-ML API if the existing cluster was launched using non-ML API
def _use_az_ml() -> bool: | ||
return skypilot_config.get_nested(('azure', 'use_az_ml'), False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of checking the config in skypilot config, it would be better to keep the config in cluster yaml (e.g., provider_config
section) and create instances purely based on the cluster yaml. In this case, it is easier to allow old clusters to still work when user change skypiot config.
self._use_az_ml = skypilot_config.get_nested(('azure', 'use_az_ml'), | ||
False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a cloud-specific setting in Resources
is not clean. Can we get rid of it?
def _try_validate_az_ml(self) -> None: | ||
if not self.use_az_ml: | ||
return | ||
if self.ports is not None: | ||
with ux_utils.print_exception_no_traceback(): | ||
raise ValueError( | ||
'Open ports are not supported for Azure Machine Learning.') | ||
if (self.disk_tier is not None or | ||
self.disk_size != _DEFAULT_DISK_SIZE_GB): | ||
# Azure ML does not support custom disk size and disk tier. | ||
# Reference: https://stackoverflow.com/questions/66923216/change-disk-type-azure-ml # pylint: disable=line-too-long | ||
with ux_utils.print_exception_no_traceback(): | ||
raise ValueError('Custom disk size and disk tier are not ' | ||
'supported for Azure Machine Learning.') | ||
if self.image_id is not None and self.extract_docker_image() is None: | ||
with ux_utils.print_exception_no_traceback(): | ||
raise ValueError( | ||
'Custom image is not supported for Azure Machine Learning.') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not clean. Why can't we add the check to unsupported_features of a specific cloud?
if version < 20: | ||
self._use_az_ml = state.pop('_use_az_ml', False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid this.
# Already checked in resources | ||
assert ssh_proxy_command is None, ( | ||
'ssh_proxy_command is not supported when using docker.') | ||
self.ip = 'localhost' | ||
self.ssh_user = docker_user | ||
self.port = constants.DEFAULT_DOCKER_PORT | ||
self._docker_ssh_proxy_command = lambda ssh: ' '.join( | ||
ssh + ssh_options_list(ssh_private_key, None | ||
ssh + ssh_options_list(ssh_private_key, None, port=port | ||
) + ['-W', '%h:%p', f'{ssh_user}@{ip}']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this an Azure ML cluster-specific change? Should we have another PR for this?
Add Support for Azure ML Compute Instance.
Related: skypilot-org/skypilot-catalog#84
Tested (run the relevant ones):
bash format.sh
sky launch
,sky stop
,sky exec
,sky start
works well for Azure MLpytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh