Skip to content

Commit

Permalink
Merge branch 'master' into dev/zeping/support_filter_flag_for_buildkite
Browse files Browse the repository at this point in the history
  • Loading branch information
zpoint committed Jan 6, 2025
2 parents 493d9e3 + 38a822a commit 6786855
Show file tree
Hide file tree
Showing 19 changed files with 521 additions and 97 deletions.
42 changes: 41 additions & 1 deletion docs/source/examples/managed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,7 @@ The :code:`MOUNT` mode in :ref:`SkyPilot bucket mounting <sky-storage>` ensures
Note that the application code should save program checkpoints periodically and reload those states when the job is restarted.
This is typically achieved by reloading the latest checkpoint at the beginning of your program.


.. _spot-jobs-end-to-end:

An End-to-End Example
Expand Down Expand Up @@ -455,6 +456,46 @@ especially useful when there are many in-progress jobs to monitor, which the
terminal-based CLI may need more than one page to display.


.. _intermediate-bucket:

Intermediate storage for files
------------------------------

For managed jobs, SkyPilot requires an intermediate bucket to store files used in the task, such as local file mounts, temporary files, and the workdir.
If you do not configure a bucket, SkyPilot will automatically create a temporary bucket named :code:`skypilot-filemounts-{username}-{run_id}` for each job launch. SkyPilot automatically deletes the bucket after the job completes.

Alternatively, you can pre-provision a bucket and use it as an intermediate for storing file by setting :code:`jobs.bucket` in :code:`~/.sky/config.yaml`:

.. code-block:: yaml
# ~/.sky/config.yaml
jobs:
bucket: s3://my-bucket # Supports s3://, gs://, https://<azure_storage_account>.blob.core.windows.net/<container>, r2://, cos://<region>/<bucket>
If you choose to specify a bucket, ensure that the bucket already exists and that you have the necessary permissions.

When using a pre-provisioned intermediate bucket with :code:`jobs.bucket`, SkyPilot creates job-specific directories under the bucket root to store files. They are organized in the following structure:

.. code-block:: text
# cloud bucket, s3://my-bucket/ for example
my-bucket/
├── job-15891b25/ # Job-specific directory
│ ├── local-file-mounts/ # Files from local file mounts
│ ├── tmp-files/ # Temporary files
│ └── workdir/ # Files from workdir
└── job-cae228be/ # Another job's directory
├── local-file-mounts/
├── tmp-files/
└── workdir/
When using a custom bucket (:code:`jobs.bucket`), the job-specific directories (e.g., :code:`job-15891b25/`) created by SkyPilot are removed when the job completes.

.. tip::
Multiple users can share the same intermediate bucket. Each user's jobs will have their own unique job-specific directories, ensuring that files are kept separate and organized.


Concept: Jobs Controller
------------------------

Expand Down Expand Up @@ -505,4 +546,3 @@ The :code:`resources` field has the same spec as a normal SkyPilot job; see `her
These settings will not take effect if you have an existing controller (either
stopped or live). For them to take effect, tear down the existing controller
first, which requires all in-progress jobs to finish or be canceled.

93 changes: 61 additions & 32 deletions docs/source/reference/kubernetes/kubernetes-getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,67 @@ After launching the cluster with :code:`sky launch -c myclus task.yaml`, you can

To learn more about opening ports in SkyPilot tasks, see :ref:`Opening Ports <ports>`.

Customizing SkyPilot pods
-------------------------

You can override the pod configuration used by SkyPilot by setting the :code:`pod_config` key in :code:`~/.sky/config.yaml`.
The value of :code:`pod_config` should be a dictionary that follows the `Kubernetes Pod API <https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#pod-v1-core>`_. This will apply to all pods created by SkyPilot.

For example, to set custom environment variables and use GPUDirect RDMA, you can add the following to your :code:`~/.sky/config.yaml` file:

.. code-block:: yaml
# ~/.sky/config.yaml
kubernetes:
pod_config:
spec:
containers:
- env: # Custom environment variables to set in pod
- name: MY_ENV_VAR
value: MY_ENV_VALUE
resources: # Custom resources for GPUDirect RDMA
requests:
rdma/rdma_shared_device_a: 1
limits:
rdma/rdma_shared_device_a: 1
Similarly, you can attach `Kubernetes volumes <https://kubernetes.io/docs/concepts/storage/volumes/>`_ (e.g., an `NFS volume <https://kubernetes.io/docs/concepts/storage/volumes/#nfs>`_) directly to your SkyPilot pods:

.. code-block:: yaml
# ~/.sky/config.yaml
kubernetes:
pod_config:
spec:
containers:
- volumeMounts: # Custom volume mounts for the pod
- mountPath: /data
name: nfs-volume
volumes:
- name: nfs-volume
nfs: # Alternatively, use hostPath if your NFS is directly attached to the nodes
server: nfs.example.com
path: /nfs
.. tip::

As an alternative to setting ``pod_config`` globally, you can also set it on a per-task basis directly in your task YAML with the ``config_overrides`` :ref:`field <task-yaml-experimental>`.

.. code-block:: yaml
# task.yaml
run: |
python myscript.py
# Set pod_config for this task
experimental:
config_overrides:
pod_config:
...
FAQs
----

Expand Down Expand Up @@ -293,38 +354,6 @@ FAQs

You can use your existing observability tools to filter resources with the label :code:`parent=skypilot` (:code:`kubectl get pods -l 'parent=skypilot'`). As an example, follow the instructions :ref:`here <kubernetes-observability>` to deploy the Kubernetes Dashboard on your cluster.

* **How can I specify custom configuration for the pods created by SkyPilot?**

You can override the pod configuration used by SkyPilot by setting the :code:`pod_config` key in :code:`~/.sky/config.yaml`.
The value of :code:`pod_config` should be a dictionary that follows the `Kubernetes Pod API <https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#pod-v1-core>`_.

For example, to set custom environment variables and attach a volume on your pods, you can add the following to your :code:`~/.sky/config.yaml` file:

.. code-block:: yaml
kubernetes:
pod_config:
spec:
containers:
- env:
- name: MY_ENV_VAR
value: MY_ENV_VALUE
volumeMounts: # Custom volume mounts for the pod
- mountPath: /foo
name: example-volume
resources: # Custom resource requests and limits
requests:
rdma/rdma_shared_device_a: 1
limits:
rdma/rdma_shared_device_a: 1
volumes:
- name: example-volume
hostPath:
path: /tmp
type: Directory
For more details refer to :ref:`config-yaml`.

* **I am using a custom image. How can I speed up the pod startup time?**

You can pre-install SkyPilot dependencies in your custom image to speed up the pod startup time. Simply add these lines at the end of your Dockerfile:
Expand Down
27 changes: 23 additions & 4 deletions docs/source/reference/yaml-spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -176,9 +176,9 @@ Available fields:
# tpu_vm: True # True to use TPU VM (the default); False to use TPU node.
# Custom image id (optional, advanced). The image id used to boot the
# instances. Only supported for AWS and GCP (for non-docker image). If not
# specified, SkyPilot will use the default debian-based image suitable for
# machine learning tasks.
# instances. Only supported for AWS, GCP, OCI and IBM (for non-docker image).
# If not specified, SkyPilot will use the default debian-based image
# suitable for machine learning tasks.
#
# Docker support
# You can specify docker image to use by setting the image_id to
Expand All @@ -204,7 +204,7 @@ Available fields:
# image_id:
# us-east-1: ami-0729d913a335efca7
# us-west-2: ami-050814f384259894c
image_id: ami-0868a20f5a3bf9702
#
# GCP
# To find GCP images: https://cloud.google.com/compute/docs/images
# image_id: projects/deeplearning-platform-release/global/images/common-cpu-v20230615-debian-11-py310
Expand All @@ -215,6 +215,24 @@ Available fields:
# To find Azure images: https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
# image_id: microsoft-dsvm:ubuntu-2004:2004:21.11.04
#
# OCI
# To find OCI images: https://docs.oracle.com/en-us/iaas/images
# You can choose the image with OS version from the following image tags
# provided by SkyPilot:
# image_id: skypilot:gpu-ubuntu-2204
# image_id: skypilot:gpu-ubuntu-2004
# image_id: skypilot:gpu-oraclelinux9
# image_id: skypilot:gpu-oraclelinux8
# image_id: skypilot:cpu-ubuntu-2204
# image_id: skypilot:cpu-ubuntu-2004
# image_id: skypilot:cpu-oraclelinux9
# image_id: skypilot:cpu-oraclelinux8
#
# It is also possible to specify your custom image's OCID with OS type,
# for example:
# image_id: ocid1.image.oc1.us-sanjose-1.aaaaaaaaywwfvy67wwe7f24juvjwhyjn3u7g7s3wzkhduxcbewzaeki2nt5q:oraclelinux
# image_id: ocid1.image.oc1.us-sanjose-1.aaaaaaaa5tnuiqevhoyfnaa5pqeiwjv6w5vf6w4q2hpj3atyvu3yd6rhlhyq:ubuntu
#
# IBM
# Create a private VPC image and paste its ID in the following format:
# image_id: <unique_image_id>
Expand All @@ -224,6 +242,7 @@ Available fields:
# https://www.ibm.com/cloud/blog/use-ibm-packer-plugin-to-create-custom-images-on-ibm-cloud-vpc-infrastructure
# To use a more limited but easier to manage tool:
# https://github.com/IBM/vpc-img-inst
image_id: ami-0868a20f5a3bf9702
# Labels to apply to the instances (optional).
#
Expand Down
33 changes: 33 additions & 0 deletions examples/oci/gpu-oraclelinux9.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: gpu-task

resources:
# Optional; if left out, automatically pick the cheapest cloud.
cloud: oci

accelerators: A10:1

disk_size: 1024

disk_tier: high

image_id: skypilot:gpu-oraclelinux9


# Working directory (optional) containing the project codebase.
# Its contents are synced to ~/sky_workdir/ on the cluster.
workdir: .

num_nodes: 1

# Typical use: pip install -r requirements.txt
# Invoked under the workdir (i.e., can use its files).
setup: |
echo "*** Running setup. ***"
# Typical use: make use of resources, such as running training.
# Invoked under the workdir (i.e., can use its files).
run: |
echo "*** Running the task on OCI ***"
echo "hello, world"
nvidia-smi
echo "The task is completed."
33 changes: 33 additions & 0 deletions examples/oci/gpu-ubuntu-2204.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: gpu-task

resources:
# Optional; if left out, automatically pick the cheapest cloud.
cloud: oci

accelerators: A10:1

disk_size: 1024

disk_tier: high

image_id: skypilot:gpu-ubuntu-2204


# Working directory (optional) containing the project codebase.
# Its contents are synced to ~/sky_workdir/ on the cluster.
workdir: .

num_nodes: 1

# Typical use: pip install -r requirements.txt
# Invoked under the workdir (i.e., can use its files).
setup: |
echo "*** Running setup. ***"
# Typical use: make use of resources, such as running training.
# Invoked under the workdir (i.e., can use its files).
run: |
echo "*** Running the task on OCI ***"
echo "hello, world"
nvidia-smi
echo "The task is completed."
43 changes: 43 additions & 0 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -650,6 +650,42 @@ def _restore_block(new_block: Dict[str, Any], old_block: Dict[str, Any]):
return common_utils.dump_yaml_str(new_config)


def get_expirable_clouds(
enabled_clouds: Sequence[clouds.Cloud]) -> List[clouds.Cloud]:
"""Returns a list of clouds that use local credentials and whose credentials can expire.
This function checks each cloud in the provided sequence to determine if it uses local credentials
and if its credentials can expire. If both conditions are met, the cloud is added to the list of
expirable clouds.
Args:
enabled_clouds (Sequence[clouds.Cloud]): A sequence of cloud objects to check.
Returns:
list[clouds.Cloud]: A list of cloud objects that use local credentials and whose credentials can expire.
"""
expirable_clouds = []
local_credentials_value = schemas.RemoteIdentityOptions.LOCAL_CREDENTIALS.value
for cloud in enabled_clouds:
remote_identities = skypilot_config.get_nested(
(str(cloud).lower(), 'remote_identity'), None)
if remote_identities is None:
remote_identities = schemas.get_default_remote_identity(
str(cloud).lower())

local_credential_expiring = cloud.can_credential_expire()
if isinstance(remote_identities, str):
if remote_identities == local_credentials_value and local_credential_expiring:
expirable_clouds.append(cloud)
elif isinstance(remote_identities, list):
for profile in remote_identities:
if list(profile.values(
))[0] == local_credentials_value and local_credential_expiring:
expirable_clouds.append(cloud)
break
return expirable_clouds


# TODO: too many things happening here - leaky abstraction. Refactor.
@timeline.event
def write_cluster_config(
Expand Down Expand Up @@ -926,6 +962,13 @@ def write_cluster_config(
tmp_yaml_path,
cluster_config_overrides=to_provision.cluster_config_overrides)
kubernetes_utils.combine_metadata_fields(tmp_yaml_path)
yaml_obj = common_utils.read_yaml(tmp_yaml_path)
pod_config = yaml_obj['available_node_types']['ray_head_default'][
'node_config']
valid, message = kubernetes_utils.check_pod_config(pod_config)
if not valid:
raise exceptions.InvalidCloudConfigs(
f'Invalid pod_config. Details: {message}')

if dryrun:
# If dryrun, return the unfinished tmp yaml path.
Expand Down
17 changes: 17 additions & 0 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@

import sky
from sky import backends
from sky import check as sky_check
from sky import cloud_stores
from sky import clouds
from sky import exceptions
Expand Down Expand Up @@ -1996,6 +1997,22 @@ def provision_with_retries(
skip_unnecessary_provisioning else None)

failover_history: List[Exception] = list()
# If the user is using local credentials which may expire, the
# controller may leak resources if the credentials expire while a job
# is running. Here we check the enabled clouds and expiring credentials
# and raise a warning to the user.
if task.is_controller_task():
enabled_clouds = sky_check.get_cached_enabled_clouds_or_refresh()
expirable_clouds = backend_utils.get_expirable_clouds(
enabled_clouds)

if len(expirable_clouds) > 0:
warnings = (f'\033[93mWarning: Credentials used for '
f'{expirable_clouds} may expire. Clusters may be '
f'leaked if the credentials expire while jobs '
f'are running. It is recommended to use credentials'
f' that never expire or a service account.\033[0m')
logger.warning(warnings)

# Retrying launchable resources.
while True:
Expand Down
5 changes: 4 additions & 1 deletion sky/backends/wheel_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,10 @@ def _get_latest_modification_time(path: pathlib.Path) -> float:
if not path.exists():
return -1.
try:
return max(os.path.getmtime(root) for root, _, _ in os.walk(path))
return max(
os.path.getmtime(os.path.join(root, f))
for root, dirs, files in os.walk(path)
for f in (*dirs, *files))
except ValueError:
return -1.

Expand Down
Loading

0 comments on commit 6786855

Please sign in to comment.