Skip to content

Commit

Permalink
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi
Browse files Browse the repository at this point in the history
  • Loading branch information
Michaelvll committed Aug 23, 2024
2 parents b65011a + 8a0b1a1 commit 57260e2
Show file tree
Hide file tree
Showing 115 changed files with 3,276 additions and 1,054 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,6 @@ sky/clouds/service_catalog/data_fetchers/*.csv
.vscode/
.idea/
.env

# For editor files
*.swp
43 changes: 24 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,13 @@

</p>


<h3 align="center">
Run LLMs and AI on Any Cloud
Run AI on Any Infra — Unified, Faster, Cheaper
</h3>

----
:fire: *News* :fire:
- [Jul, 2024] [Finetune](./llm/llama-3_1-finetuning/) and [serve](./llm/llama-3_1/) **Llama 3.1** on your infra
- [Jun, 2024] Reproduce **GPT** with [llm.c](https://github.com/karpathy/llm.c/discussions/481) on any cloud: [**guide**](./llm/gpt-2/)
- [Apr, 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
- [Apr, 2024] Serve [**Qwen-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) on your infra: [**example**](./llm/qwen/)
Expand All @@ -37,7 +37,6 @@
- [Nov, 2023] Using [**Axolotl**](https://github.com/OpenAccess-AI-Collective/axolotl) to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/)
- [Sep, 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/)
- [Aug, 2023] **Finetuning Cookbook**: Finetuning Llama 2 in your own cloud environment, privately: [**example**](./llm/vicuna-llama-2/), [**blog post**](https://blog.skypilot.co/finetuning-llama2-operational-guide/)
- [June, 2023] Serving LLM 24x Faster On the Cloud [**with vLLM**](https://vllm.ai/) and SkyPilot: [**example**](./llm/vllm/), [**blog post**](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/)

<details>
<summary>Archived</summary>
Expand All @@ -47,39 +46,42 @@
- [Dec, 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/)
- [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot)
- [July, 2023] Self-Hosted **Llama-2 Chatbot** on Any Cloud: [**example**](./llm/llama-2/)
- [June, 2023] Serving LLM 24x Faster On the Cloud [**with vLLM**](https://vllm.ai/) and SkyPilot: [**example**](./llm/vllm/), [**blog post**](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/)
- [April, 2023] [SkyPilot YAMLs](./llm/vicuna/) for finetuning & serving the [Vicuna LLM](https://lmsys.org/blog/2023-03-30-vicuna/) with a single command!

</details>

----

SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution.
SkyPilot is a framework for running AI and batch workloads on any infra, offering unified execution, high cost savings, and high GPU availability.

SkyPilot **abstracts away cloud infra burdens**:
- Launch jobs & clusters on any cloud
- Easy scale-out: queue and run many jobs, automatically managed
- Easy access to object stores (S3, GCS, R2)
SkyPilot **abstracts away infra burdens**:
- Launch [dev clusters](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html), [jobs](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html), and [serving](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) on any infra
- Easy job management: queue, run, and auto-recover many jobs

SkyPilot **maximizes GPU availability for your jobs**:
* Provision in all zones/regions/clouds you have access to ([the _Sky_](https://arxiv.org/abs/2205.07147)), with automatic failover
SkyPilot **supports multiple clusters, clouds, and hardware** ([the Sky](https://arxiv.org/abs/2205.07147)):
- Bring your reserved GPUs, Kubernetes clusters, or 12+ clouds
- [Flexible provisioning](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html) of GPUs, TPUs, CPUs, with auto-retry

SkyPilot **cuts your cloud costs**:
* [Managed Spot](https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html): 3-6x cost savings using spot VMs, with auto-recovery from preemptions
* [Optimizer](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html): 2x cost savings by auto-picking the cheapest VM/zone/region/cloud
* [Autostop](https://skypilot.readthedocs.io/en/latest/reference/auto-stop.html): hands-free cleanup of idle clusters
SkyPilot **cuts your cloud costs & maximizes GPU availability**:
* [Autostop](https://skypilot.readthedocs.io/en/latest/reference/auto-stop.html): automatic cleanup of idle resources
* [Managed Spot](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html): 3-6x cost savings using spot instances, with preemption auto-recovery
* [Optimizer](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html): 2x cost savings by auto-picking the cheapest & most available infra

SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.

Install with pip (we recommend the nightly build for the latest features or [from source](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)):
Install with pip:
```bash
pip install "skypilot-nightly[aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp,kubernetes]" # choose your clouds
# Choose your clouds:
pip install -U "skypilot[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp]"
```
To get the last release, use:
To get the latest features and fixes, use the nightly build or [install from source](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html):
```bash
pip install -U "skypilot[aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp,kubernetes]" # choose your clouds
# Choose your clouds:
pip install "skypilot-nightly[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp]"
```

Current supported providers (AWS, Azure, GCP, OCI, Lambda Cloud, RunPod, Fluidstack, Paperspace, Cudo, IBM, Samsung, Cloudflare, any Kubernetes cluster):
[Current supported infra](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html) (Kubernetes; AWS, GCP, Azure, OCI, Lambda Cloud, Fluidstack, RunPod, Cudo, Paperspace, Cloudflare, Samsung, IBM, VMware vSphere):
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/cloud-logos-dark.png">
Expand Down Expand Up @@ -154,6 +156,7 @@ To learn more, see our [Documentation](https://skypilot.readthedocs.io/en/latest
<!-- Keep this section in sync with index.rst in SkyPilot Docs -->
Runnable examples:
- LLMs on SkyPilot
- [Llama 3.1 finetuning](./llm/llama-3_1-finetuning/) and [serving](./llm/llama-3_1/)
- [GPT-2 via `llm.c`](./llm/gpt-2/)
- [Llama 3](./llm/llama-3/)
- [Qwen](./llm/qwen/)
Expand All @@ -176,6 +179,8 @@ Runnable examples:
- Add yours here & see more in [`llm/`](./llm)!
- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [Ray Train](examples/distributed_ray_train/ray_train.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), [Cog](https://github.com/skypilot-org/skypilot/blob/master/examples/cog/), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), [Ollama](https://github.com/skypilot-org/skypilot/blob/master/llm/ollama), [llm.c](https://github.com/skypilot-org/skypilot/tree/master/llm/gpt-2) and [many more (`examples/`)](./examples).

Case Studies and Integrations: [Community Spotlights](https://blog.skypilot.co/community/)

Follow updates:
- [Twitter](https://twitter.com/skypilot_org)
- [Slack](http://slack.skypilot.co)
Expand Down
1 change: 1 addition & 0 deletions docs/source/_gallery_original/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ Contents
DBRX (Databricks) <llms/dbrx>
Llama-2 (Meta) <llms/llama-2>
Llama-3 (Meta) <llms/llama-3>
Llama-3.1 (Meta) <llms/llama-3_1>
Qwen (Alibaba) <llms/qwen>
CodeLlama (Meta) <llms/codellama>
Gemma (Google) <llms/gemma>
Expand Down
1 change: 1 addition & 0 deletions docs/source/_gallery_original/llms/llama-3_1.md
4 changes: 1 addition & 3 deletions docs/source/_static/custom.js
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,7 @@ document.addEventListener('DOMContentLoaded', () => {
{ selector: '.caption-text', text: 'SkyServe: Model Serving' },
{ selector: '.toctree-l1 > a', text: 'Managed Jobs' },
{ selector: '.toctree-l1 > a', text: 'Running on Kubernetes' },
{ selector: '.toctree-l1 > a', text: 'Ollama' },
{ selector: '.toctree-l1 > a', text: 'Llama-3 (Meta)' },
{ selector: '.toctree-l1 > a', text: 'Qwen (Alibaba)' },
{ selector: '.toctree-l1 > a', text: 'Llama-3.1 (Meta)' },
];
newItems.forEach(({ selector, text }) => {
document.querySelectorAll(selector).forEach((el) => {
Expand Down
2 changes: 1 addition & 1 deletion docs/source/cloud-setup/cloud-permissions/aws.rst
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ AWS accounts can be attached with a policy that limits the permissions of the ac
:align: center
:alt: AWS Add Policy

8. **Optional**: If you would like to have your users access S3 buckets: You can additionally attach S3 access, such as the "AmazonS3FullAccess" policy.
8. **Optional**: If you would like to have your users access S3 buckets: You can additionally attach S3 access, such as the "AmazonS3FullAccess" policy. Note that enabling S3 access is required to use :ref:`managed-jobs` with `workdir` or `file_mounts` for now.

.. image:: ../../images/screenshots/aws/aws-s3-policy.png
:width: 80%
Expand Down
71 changes: 48 additions & 23 deletions docs/source/docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,52 +2,53 @@ Welcome to SkyPilot!
====================

.. image:: /_static/SkyPilot_wide_dark.svg
:width: 65%
:width: 50%
:align: center
:alt: SkyPilot
:class: no-scaled-link, only-dark
.. image:: /_static/SkyPilot_wide_light.svg
:width: 60%
:width: 50%
:align: center
:alt: SkyPilot
:class: no-scaled-link, only-light


.. raw:: html

<p></p>
<p style="text-align:center">
<strong>Run AI on Any Infra</strong> — Unified, Faster, Cheaper
</p>
<p style="text-align:center">
<a class="github-button" href="https://github.com/skypilot-org/skypilot" data-show-count="true" data-size="large" aria-label="Star skypilot-org/skypilot on GitHub">Star</a>
<a class="github-button" href="https://github.com/skypilot-org/skypilot/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch skypilot-org/skypilot on GitHub">Watch</a>
<a class="github-button" href="https://github.com/skypilot-org/skypilot/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork skypilot-org/skypilot on GitHub">Fork</a>
<br>
<a class="reference external image-reference" style="vertical-align:9.5px" href="http://slack.skypilot.co"><img src="https://img.shields.io/badge/SkyPilot-Join%20Slack-blue?logo=slack" style="height:27px"></a>
<script async defer src="https://buttons.github.io/buttons.js"></script>
</p>

<p style="text-align:center">
<strong>Run LLMs and AI on Any Cloud</strong>
</p>

SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution.
SkyPilot is a framework for running AI and batch workloads on any infra, offering unified execution, high cost savings, and high GPU availability.

SkyPilot **abstracts away cloud infra burdens**:
SkyPilot **abstracts away infra burdens**:

- Launch jobs & clusters on any cloud
- Easy scale-out: queue and run many jobs, automatically managed
- Easy access to object stores (S3, GCS, R2)
- Launch :ref:`dev clusters <dev-cluster>`, :ref:`jobs <managed-jobs>`, and :ref:`serving <sky-serve>` on any infra
- Easy job management: queue, run, and auto-recover many jobs

SkyPilot **maximizes GPU availability for your jobs**:
SkyPilot **supports multiple clusters, clouds, and hardware** (`the Sky <https://arxiv.org/abs/2205.07147>`_):

* Provision in all zones/regions/clouds you have access to (`the Sky <https://arxiv.org/abs/2205.07147>`_), with automatic failover
- Bring your reserved GPUs, Kubernetes clusters, or 12+ clouds
- :ref:`Flexible provisioning <auto-failover>` of GPUs, TPUs, CPUs, with auto-retry

SkyPilot **cuts your cloud costs**:
SkyPilot **cuts your cloud costs & maximizes GPU availability**:

* `Managed Spot <https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html>`_: 3-6x cost savings using spot VMs, with auto-recovery from preemptions
* Optimizer: 2x cost savings by auto-picking the cheapest VM/zone/region/cloud
* `Autostop <https://skypilot.readthedocs.io/en/latest/reference/auto-stop.html>`_: hands-free cleanup of idle clusters
* :ref:`Autostop <auto-stop>`: automatic cleanup of idle resources
* :ref:`Managed Spot <managed-jobs>`: 3-6x cost savings using spot instances, with preemption auto-recovery
* :ref:`Optimizer <auto-failover>`: 2x cost savings by auto-picking the cheapest & most available infra

SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.

Current supported providers (AWS, GCP, Azure, OCI, Lambda Cloud, RunPod, Fluidstack, Cudo, IBM, Samsung, Cloudflare, VMware vSphere, any Kubernetes cluster):
:ref:`Current supported infra <installation>` (Kubernetes; AWS, GCP, Azure, OCI, Lambda Cloud, Fluidstack, RunPod, Cudo, Paperspace, Cloudflare, Samsung, IBM, VMware vSphere):

.. raw:: html

Expand All @@ -58,17 +59,28 @@ Current supported providers (AWS, GCP, Azure, OCI, Lambda Cloud, RunPod, Fluidst
</picture>
</p>

More Information
--------------------------
Ready to get started?
----------------------

Tutorials: `SkyPilot Tutorials <https://github.com/skypilot-org/skypilot-tutorial>`_
:ref:`Install SkyPilot <installation>` in ~1 minute. Then, launch your first dev cluster in ~5 minutes in :ref:`Quickstart <quickstart>`.

Everything is launched within your cloud accounts, VPCs, and cluster(s).

Contact the SkyPilot team
---------------------------------

You can chat with the SkyPilot team and community on the `SkyPilot Slack <http://slack.skypilot.co>`_.

Learn more
--------------------------

Runnable examples:

.. Keep this section in sync with README.md in SkyPilot repo
* **LLMs on SkyPilot**

* `Llama 3.1 finetuning <https://github.com/skypilot-org/skypilot/tree/master/llm/llama-3_1-finetuning>`_ and `serving <https://github.com/skypilot-org/skypilot/tree/master/llm/llama-3_1>`_
* `GPT-2 via llm.c <https://github.com/skypilot-org/skypilot/tree/master/llm/gpt-2>`_
* `Llama 3 <https://github.com/skypilot-org/skypilot/tree/master/llm/llama-3>`_
* `Qwen <https://github.com/skypilot-org/skypilot/tree/master/llm/qwen>`_
Expand All @@ -92,6 +104,10 @@ Runnable examples:

* Framework examples: `PyTorch DDP <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml>`_, `DeepSpeed <https://github.com/skypilot-org/skypilot/blob/master/examples/deepspeed-multinode/sky.yaml>`_, `JAX/Flax on TPU <https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml>`_, `Stable Diffusion <https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion>`_, `Detectron2 <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_, `Distributed <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py>`_ `TensorFlow <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml>`_, `NeMo <https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo_gpt_train.yaml>`_, `programmatic grid search <https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py>`_, `Docker <https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml>`_, `Cog <https://github.com/skypilot-org/skypilot/blob/master/examples/cog/>`_, `Unsloth <https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml>`_, `Ollama <https://github.com/skypilot-org/skypilot/blob/master/llm/ollama>`_, `llm.c <https://github.com/skypilot-org/skypilot/tree/master/llm/gpt-2>`__ and `many more <https://github.com/skypilot-org/skypilot/tree/master/examples>`_.

Case Studies and Integrations: `Community Spotlights <https://blog.skypilot.co/community/>`_

Tutorials: `SkyPilot Tutorials <https://github.com/skypilot-org/skypilot-tutorial>`_

Follow updates:

* `Twitter <https://twitter.com/skypilot_org>`_
Expand All @@ -105,10 +121,9 @@ Read the research:
* `Sky Computing vision paper <https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s02-stoica.pdf>`_ (HotOS 2021)


Contents
--------

.. toctree::
:hidden:
:maxdepth: 1
:caption: Getting Started

Expand All @@ -119,6 +134,7 @@ Contents


.. toctree::
:hidden:
:maxdepth: 1
:caption: Running Jobs

Expand All @@ -129,6 +145,7 @@ Contents
../running-jobs/distributed-jobs

.. toctree::
:hidden:
:maxdepth: 1
:caption: SkyServe: Model Serving

Expand All @@ -137,6 +154,7 @@ Contents
../serving/service-yaml-spec

.. toctree::
:hidden:
:maxdepth: 1
:caption: Cutting Cloud Costs

Expand All @@ -145,13 +163,15 @@ Contents
../reference/benchmark/index

.. toctree::
:hidden:
:maxdepth: 1
:caption: Using Data

../examples/syncing-code-artifacts
../reference/storage

.. toctree::
:hidden:
:maxdepth: 1
:caption: User Guides

Expand All @@ -161,16 +181,20 @@ Contents
../reference/tpu
../reference/logging
../reference/faq
SkyPilot vs. Other Systems <../reference/comparison>



.. toctree::
:hidden:
:maxdepth: 1
:caption: Developer Guides

../developers/CONTRIBUTING
Guide: Adding a New Cloud <https://docs.google.com/document/d/1oWox3qb3Kz3wXXSGg9ZJWwijoa99a3PIQUHBR8UgEGs/edit?usp=sharing>

.. toctree::
:hidden:
:maxdepth: 1
:caption: Cloud Admin and Usage

Expand All @@ -179,6 +203,7 @@ Contents
../cloud-setup/quota

.. toctree::
:hidden:
:maxdepth: 1
:caption: References

Expand Down
3 changes: 2 additions & 1 deletion docs/source/examples/docker-containers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ SkyPilot can run a container either as a task, or as the runtime environment of

.. note::

Running docker containers is `not supported on RunPod <https://docs.runpod.io/references/faq#can-i-run-my-own-docker-daemon-on-runpod>`_. To use RunPod, use ``setup`` and ``run`` to configure your environment. See `GitHub issue <https://github.com/skypilot-org/skypilot/issues/3096#issuecomment-2150559797>`_ for more.
Running docker containers is `not supported on RunPod <https://docs.runpod.io/references/faq#can-i-run-my-own-docker-daemon-on-runpod>`_. To use RunPod, either use your docker image (the username should be ``root`` for RunPod) :ref:`as a runtime environment <docker-containers-as-runtime-environments>` or use ``setup`` and ``run`` to configure your environment. See `GitHub issue <https://github.com/skypilot-org/skypilot/issues/3096#issuecomment-2150559797>`_ for more.


.. _docker-containers-as-tasks:

Expand Down
Loading

0 comments on commit 57260e2

Please sign in to comment.