Skip to content

Commit

Permalink
Merge branch 'master' into dev/zeping/fix_smoke_tests
Browse files Browse the repository at this point in the history
  • Loading branch information
zpoint committed Jan 8, 2025
2 parents 1844001 + 2fa37ec commit 758c31a
Show file tree
Hide file tree
Showing 29 changed files with 804 additions and 264 deletions.
63 changes: 0 additions & 63 deletions .github/workflows/test-poetry-build.yml

This file was deleted.

42 changes: 41 additions & 1 deletion docs/source/examples/managed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,7 @@ The :code:`MOUNT` mode in :ref:`SkyPilot bucket mounting <sky-storage>` ensures
Note that the application code should save program checkpoints periodically and reload those states when the job is restarted.
This is typically achieved by reloading the latest checkpoint at the beginning of your program.


.. _spot-jobs-end-to-end:

An End-to-End Example
Expand Down Expand Up @@ -455,6 +456,46 @@ especially useful when there are many in-progress jobs to monitor, which the
terminal-based CLI may need more than one page to display.


.. _intermediate-bucket:

Intermediate storage for files
------------------------------

For managed jobs, SkyPilot requires an intermediate bucket to store files used in the task, such as local file mounts, temporary files, and the workdir.
If you do not configure a bucket, SkyPilot will automatically create a temporary bucket named :code:`skypilot-filemounts-{username}-{run_id}` for each job launch. SkyPilot automatically deletes the bucket after the job completes.

Alternatively, you can pre-provision a bucket and use it as an intermediate for storing file by setting :code:`jobs.bucket` in :code:`~/.sky/config.yaml`:

.. code-block:: yaml
# ~/.sky/config.yaml
jobs:
bucket: s3://my-bucket # Supports s3://, gs://, https://<azure_storage_account>.blob.core.windows.net/<container>, r2://, cos://<region>/<bucket>
If you choose to specify a bucket, ensure that the bucket already exists and that you have the necessary permissions.

When using a pre-provisioned intermediate bucket with :code:`jobs.bucket`, SkyPilot creates job-specific directories under the bucket root to store files. They are organized in the following structure:

.. code-block:: text
# cloud bucket, s3://my-bucket/ for example
my-bucket/
├── job-15891b25/ # Job-specific directory
│ ├── local-file-mounts/ # Files from local file mounts
│ ├── tmp-files/ # Temporary files
│ └── workdir/ # Files from workdir
└── job-cae228be/ # Another job's directory
├── local-file-mounts/
├── tmp-files/
└── workdir/
When using a custom bucket (:code:`jobs.bucket`), the job-specific directories (e.g., :code:`job-15891b25/`) created by SkyPilot are removed when the job completes.

.. tip::
Multiple users can share the same intermediate bucket. Each user's jobs will have their own unique job-specific directories, ensuring that files are kept separate and organized.


Concept: Jobs Controller
------------------------

Expand Down Expand Up @@ -505,4 +546,3 @@ The :code:`resources` field has the same spec as a normal SkyPilot job; see `her
These settings will not take effect if you have an existing controller (either
stopped or live). For them to take effect, tear down the existing controller
first, which requires all in-progress jobs to finish or be canceled.

2 changes: 1 addition & 1 deletion docs/source/getting-started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -304,7 +304,7 @@ RunPod

.. code-block:: shell
pip install "runpod>=1.5.1"
pip install "runpod>=1.6.1"
runpod config
Expand Down
28 changes: 19 additions & 9 deletions docs/source/reference/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -628,20 +628,30 @@ Available fields and semantics:
# Advanced OCI configurations (optional).
oci:
# A dict mapping region names to region-specific configurations, or
# `default` for the default configuration.
# `default` for the default/global configuration.
default:
# The OCID of the profile to use for launching instances (optional).
oci_config_profile: DEFAULT
# The OCID of the compartment to use for launching instances (optional).
# The profile name in ~/.oci/config to use for launching instances. If not
# set, the one named DEFAULT will be used (optional).
oci_config_profile: SKY_PROVISION_PROFILE
# The OCID of the compartment to use for launching instances. If not set,
# the root compartment will be used (optional).
compartment_ocid: ocid1.compartment.oc1..aaaaaaaahr7aicqtodxmcfor6pbqn3hvsngpftozyxzqw36gj4kh3w3kkj4q
# The image tag to use for launching general instances (optional).
image_tag_general: skypilot:cpu-ubuntu-2004
# The image tag to use for launching GPU instances (optional).
image_tag_gpu: skypilot:gpu-ubuntu-2004
# The default image tag to use for launching general instances (CPU) if the
# image_id parameter is not specified. If not set, the default is
# skypilot:cpu-ubuntu-2204 (optional).
image_tag_general: skypilot:cpu-oraclelinux8
# The default image tag to use for launching GPU instances if the image_id
# parameter is not specified. If not set, the default is
# skypilot:gpu-ubuntu-2204 (optional).
image_tag_gpu: skypilot:gpu-oraclelinux8
# Region-specific configurations
ap-seoul-1:
# The OCID of the VCN to use for instances (optional).
vcn_ocid: ocid1.vcn.oc1.ap-seoul-1.amaaaaaaak7gbriarkfs2ssus5mh347ktmi3xa72tadajep6asio3ubqgarq
# The OCID of the subnet to use for instances (optional).
vcn_subnet: ocid1.subnet.oc1.ap-seoul-1.aaaaaaaa5c6wndifsij6yfyfehmi3tazn6mvhhiewqmajzcrlryurnl7nuja
us-ashburn-1:
vcn_ocid: ocid1.vcn.oc1.ap-seoul-1.amaaaaaaak7gbriarkfs2ssus5mh347ktmi3xa72tadajep6asio3ubqgarq
vcn_subnet: ocid1.subnet.oc1.iad.aaaaaaaafbj7i3aqc4ofjaapa5edakde6g4ea2yaslcsay32cthp7qo55pxa
93 changes: 61 additions & 32 deletions docs/source/reference/kubernetes/kubernetes-getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,67 @@ After launching the cluster with :code:`sky launch -c myclus task.yaml`, you can

To learn more about opening ports in SkyPilot tasks, see :ref:`Opening Ports <ports>`.

Customizing SkyPilot pods
-------------------------

You can override the pod configuration used by SkyPilot by setting the :code:`pod_config` key in :code:`~/.sky/config.yaml`.
The value of :code:`pod_config` should be a dictionary that follows the `Kubernetes Pod API <https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#pod-v1-core>`_. This will apply to all pods created by SkyPilot.

For example, to set custom environment variables and use GPUDirect RDMA, you can add the following to your :code:`~/.sky/config.yaml` file:

.. code-block:: yaml
# ~/.sky/config.yaml
kubernetes:
pod_config:
spec:
containers:
- env: # Custom environment variables to set in pod
- name: MY_ENV_VAR
value: MY_ENV_VALUE
resources: # Custom resources for GPUDirect RDMA
requests:
rdma/rdma_shared_device_a: 1
limits:
rdma/rdma_shared_device_a: 1
Similarly, you can attach `Kubernetes volumes <https://kubernetes.io/docs/concepts/storage/volumes/>`_ (e.g., an `NFS volume <https://kubernetes.io/docs/concepts/storage/volumes/#nfs>`_) directly to your SkyPilot pods:

.. code-block:: yaml
# ~/.sky/config.yaml
kubernetes:
pod_config:
spec:
containers:
- volumeMounts: # Custom volume mounts for the pod
- mountPath: /data
name: nfs-volume
volumes:
- name: nfs-volume
nfs: # Alternatively, use hostPath if your NFS is directly attached to the nodes
server: nfs.example.com
path: /nfs
.. tip::

As an alternative to setting ``pod_config`` globally, you can also set it on a per-task basis directly in your task YAML with the ``config_overrides`` :ref:`field <task-yaml-experimental>`.

.. code-block:: yaml
# task.yaml
run: |
python myscript.py
# Set pod_config for this task
experimental:
config_overrides:
pod_config:
...
FAQs
----

Expand Down Expand Up @@ -293,38 +354,6 @@ FAQs

You can use your existing observability tools to filter resources with the label :code:`parent=skypilot` (:code:`kubectl get pods -l 'parent=skypilot'`). As an example, follow the instructions :ref:`here <kubernetes-observability>` to deploy the Kubernetes Dashboard on your cluster.

* **How can I specify custom configuration for the pods created by SkyPilot?**

You can override the pod configuration used by SkyPilot by setting the :code:`pod_config` key in :code:`~/.sky/config.yaml`.
The value of :code:`pod_config` should be a dictionary that follows the `Kubernetes Pod API <https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#pod-v1-core>`_.

For example, to set custom environment variables and attach a volume on your pods, you can add the following to your :code:`~/.sky/config.yaml` file:

.. code-block:: yaml
kubernetes:
pod_config:
spec:
containers:
- env:
- name: MY_ENV_VAR
value: MY_ENV_VALUE
volumeMounts: # Custom volume mounts for the pod
- mountPath: /foo
name: example-volume
resources: # Custom resource requests and limits
requests:
rdma/rdma_shared_device_a: 1
limits:
rdma/rdma_shared_device_a: 1
volumes:
- name: example-volume
hostPath:
path: /tmp
type: Directory
For more details refer to :ref:`config-yaml`.

* **I am using a custom image. How can I speed up the pod startup time?**

You can pre-install SkyPilot dependencies in your custom image to speed up the pod startup time. Simply add these lines at the end of your Dockerfile:
Expand Down
27 changes: 23 additions & 4 deletions docs/source/reference/yaml-spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -176,9 +176,9 @@ Available fields:
# tpu_vm: True # True to use TPU VM (the default); False to use TPU node.
# Custom image id (optional, advanced). The image id used to boot the
# instances. Only supported for AWS and GCP (for non-docker image). If not
# specified, SkyPilot will use the default debian-based image suitable for
# machine learning tasks.
# instances. Only supported for AWS, GCP, OCI and IBM (for non-docker image).
# If not specified, SkyPilot will use the default debian-based image
# suitable for machine learning tasks.
#
# Docker support
# You can specify docker image to use by setting the image_id to
Expand All @@ -204,7 +204,7 @@ Available fields:
# image_id:
# us-east-1: ami-0729d913a335efca7
# us-west-2: ami-050814f384259894c
image_id: ami-0868a20f5a3bf9702
#
# GCP
# To find GCP images: https://cloud.google.com/compute/docs/images
# image_id: projects/deeplearning-platform-release/global/images/common-cpu-v20230615-debian-11-py310
Expand All @@ -215,6 +215,24 @@ Available fields:
# To find Azure images: https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
# image_id: microsoft-dsvm:ubuntu-2004:2004:21.11.04
#
# OCI
# To find OCI images: https://docs.oracle.com/en-us/iaas/images
# You can choose the image with OS version from the following image tags
# provided by SkyPilot:
# image_id: skypilot:gpu-ubuntu-2204
# image_id: skypilot:gpu-ubuntu-2004
# image_id: skypilot:gpu-oraclelinux9
# image_id: skypilot:gpu-oraclelinux8
# image_id: skypilot:cpu-ubuntu-2204
# image_id: skypilot:cpu-ubuntu-2004
# image_id: skypilot:cpu-oraclelinux9
# image_id: skypilot:cpu-oraclelinux8
#
# It is also possible to specify your custom image's OCID with OS type,
# for example:
# image_id: ocid1.image.oc1.us-sanjose-1.aaaaaaaaywwfvy67wwe7f24juvjwhyjn3u7g7s3wzkhduxcbewzaeki2nt5q:oraclelinux
# image_id: ocid1.image.oc1.us-sanjose-1.aaaaaaaa5tnuiqevhoyfnaa5pqeiwjv6w5vf6w4q2hpj3atyvu3yd6rhlhyq:ubuntu
#
# IBM
# Create a private VPC image and paste its ID in the following format:
# image_id: <unique_image_id>
Expand All @@ -224,6 +242,7 @@ Available fields:
# https://www.ibm.com/cloud/blog/use-ibm-packer-plugin-to-create-custom-images-on-ibm-cloud-vpc-infrastructure
# To use a more limited but easier to manage tool:
# https://github.com/IBM/vpc-img-inst
image_id: ami-0868a20f5a3bf9702
# Labels to apply to the instances (optional).
#
Expand Down
33 changes: 33 additions & 0 deletions examples/oci/gpu-oraclelinux9.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: gpu-task

resources:
# Optional; if left out, automatically pick the cheapest cloud.
cloud: oci

accelerators: A10:1

disk_size: 1024

disk_tier: high

image_id: skypilot:gpu-oraclelinux9


# Working directory (optional) containing the project codebase.
# Its contents are synced to ~/sky_workdir/ on the cluster.
workdir: .

num_nodes: 1

# Typical use: pip install -r requirements.txt
# Invoked under the workdir (i.e., can use its files).
setup: |
echo "*** Running setup. ***"
# Typical use: make use of resources, such as running training.
# Invoked under the workdir (i.e., can use its files).
run: |
echo "*** Running the task on OCI ***"
echo "hello, world"
nvidia-smi
echo "The task is completed."
Loading

0 comments on commit 758c31a

Please sign in to comment.