-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Docs] Key features and managed konduktor specs (#57)
* fix selectors for memory gauge * include nv_peermem errors from dmesg * lint * format * patch space in regex * fix mem, add exporter ip to node mapping * update docs
- Loading branch information
Showing
6 changed files
with
366 additions
and
21 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
.. _architecture: | ||
|
||
============ | ||
Architecture | ||
============ | ||
|
||
.. figure:: ../images/architecture.png | ||
:width: 80% | ||
:align: center | ||
:alt: Trainy | ||
|
||
Konduktor was built with the following objectives in mind. | ||
|
||
#. ML Engineers who can train on multi-gpu already should be able to scale across nodes with little to no code changes and bring their favorite frameworks (PyTorch, Lightning, HuggingFace, Deepspeed, etc.) | ||
#. Support multi-tenancy and resource sharing via quotas | ||
#. Observability and Auto-healing to gracefully handle GPU/hardware errors | ||
|
||
which led to us building on `Kubernetes <https://kubernetes.io/>`_ as well as integrating with the following tools. | ||
|
||
#. `SkyPilot <https://skypilot.readthedocs.io/en/latest/>`_ - supports easy scaleout over nodes with declarative resource requests | ||
|
||
.. code-block:: yaml | ||
:emphasize-lines: 4-4 | ||
resources: | ||
accelerators: H100:8 | ||
num_nodes: 100 | ||
run: | | ||
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l` | ||
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1` | ||
python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--nnodes=$num_nodes --node_rank=$SKYPILOT_NODE_RANK --master_addr=$master_addr \ | ||
--master_port=8008 resnet_ddp.py --num_epochs 20 | ||
#. `Kueue <https://kueue.sigs.k8s.io/>`_ - declarative resource quotas, sharing, and job pre-emption via workload queues/priorities | ||
|
||
- ML Engineers only have to specify which queues they want to submit to | ||
.. code-block:: yaml | ||
:emphasize-lines: 4-5 | ||
resources: | ||
accelerators: H100:8 | ||
labels: | ||
kueue.x-k8s.io/queue-name: user-queue # this is the same as the queue above | ||
kueue.x-k8s.io/priority-class: high-priority # specify high-priority workload | ||
num_nodes: 100 | ||
run: | | ||
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l` | ||
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1` | ||
python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--nnodes=$num_nodes --node_rank=$SKYPILOT_NODE_RANK --master_addr=$master_addr \ | ||
--master_port=8008 resnet_ddp.py --num_epochs 20 | ||
- Cluster administrators can set GPU quotas by team via resource flavors and queues. | ||
.. code-block:: yaml | ||
:emphasize-lines: 27-28 | ||
apiVersion: kueue.x-k8s.io/v1beta1 | ||
kind: ResourceFlavor | ||
metadata: | ||
name: "default-flavor" | ||
--- | ||
apiVersion: kueue.x-k8s.io/v1beta1 | ||
kind: ClusterQueue | ||
metadata: | ||
name: "cluster-queue" | ||
spec: | ||
preemption: | ||
reclaimWithinCohort: Any | ||
borrowWithinCohort: | ||
policy: LowerPriority | ||
maxPriorityThreshold: 100 | ||
withinClusterQueue: LowerPriority | ||
namespaceSelector: {} # match all. | ||
resourceGroups: | ||
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"] | ||
flavors: | ||
- name: "default-flavor" | ||
resources: | ||
- name: "cpu" | ||
nominalQuota: 10000 | ||
- name: "memory" | ||
nominalQuota: 10000Gi | ||
- name: "nvidia.com/gpu" | ||
nominalQuota: 8 # REPLACE THIS | ||
--- | ||
apiVersion: kueue.x-k8s.io/v1beta1 | ||
kind: LocalQueue | ||
metadata: | ||
name: "user-queue" | ||
spec: | ||
clusterQueue: "cluster-queue" | ||
#. `Prometheus <https://prometheus.io/>`_, `Grafana <https://grafana.com/oss/grafana/>`_, `OpenTelemetry <https://opentelemetry.io/>`_, `Loki <https://grafana.com/oss/loki/>`_: Exporters for both GPU metrics and kernel logs as well as metrics and logging backends for longer term storage and querying. Health controllers run in a loop to detect and cordon faulty nodes preventing jobs from landing on bad hardware repeatedly ensuring continuous delivery while cluster admins have immediate feedback on where & what is causing an alert from one location. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
.. _getting_started: | ||
|
||
=============== | ||
Getting Started | ||
=============== | ||
|
||
To get access to your Trainy: Konduktor clusters, work with your account manager to add all devices that | ||
will require cluster access as client configurations in :code:`~/.sky/config.yaml`. | ||
Trainy provides isolated access to clusters via `Tailscale <https://tailscale.com/>`_, which you | ||
will need to `install <https://tailscale.com/kb/1347/installation>`_ on your development machine. | ||
You can see and connect to the clusters you have access to with: | ||
|
||
.. code-block:: bash | ||
# list clusters | ||
$ tailscale status | ||
100.85.126.7 awesomecorp-laptop awesomecorp-laptop.taila1933c.ts.net macOS - | ||
100.95.60.42 awesomecorp-gke1 tagged-devices linux idle, tx 39656 rx 1038824 | ||
100.90.169.2 awesomecorp-gke2 tagged-devices linux - | ||
# configure connection to a cluster | ||
$ tailscale configure kubeconfig awesomecorp-gke1 | ||
# check that k8s credentials work | ||
$ sky check | ||
Checking credentials to enable clouds for SkyPilot. | ||
Kubernetes: enabled | ||
Once you are connected, you can start `running jobs <usage/quickstart.html>` on your cluster. | ||
|
||
=================== | ||
Node Specifications | ||
=================== | ||
|
||
Trainy managed Konduktor comes with clusters preconfigured and validated with the right drivers and software | ||
for running workloads on GPUs enabled with high-performance networking so you can start training | ||
at scale without having to configure, autoscale, upgrade GPU infrastructure. The following clouds support | ||
autoscaling: | ||
|
||
- **GCP (a3-ultragpu), H100-80GB-MEGA:8, 1.6Tbps, 192vCPUs, 1TB RAM, 2TB disk** - **Available** ✅ | ||
- AWS on-demand/spot support, H100:8 3.2Tbps - In progress 🚧 | ||
- Azure on-demand/spot support, H100:8 3.2Tbps - In progress 🚧 | ||
|
||
On our autoscaling clusters, for now we only support :code:`H100:8` or :code:`H100-80GB-MEGA:8` instances, which | ||
can be requested as. | ||
|
||
.. code-block:: yaml | ||
num_nodes: 2 # scale up number of nodes | ||
resources: | ||
image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3 # specify your image | ||
accelerators: H100-80GB-MEGA:8 # specify the right gpu type | ||
cpus: 192+ # 192 CPUs | ||
memory: 1000+ # 1TB of RAM | ||
cloud: kubernetes | ||
labels: | ||
kueue.x-k8s.io/queue-name: user-queue # this is assigned by your admin | ||
kueue.x-k8s.io/priority-class: low-priority | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,34 +30,78 @@ Welcome to Konduktor's documentation! | |
<strong>Batch Jobs and Cluster Management for GPUs on Kubernetes</strong> | ||
</p> | ||
|
||
Konduktor is a platform designed for running ML batch jobs and managing GPU clusters. Konduktor uses existing open source tools to build a platform that empowers ML engineers by abstracting away the details of resource scheduling so they can focus on modeling. Cluster administrators will enjoy setting resource quotas and sharing between projects as well as built in monitoring to track cluster-wide resource utilization and pending jobs to adjust quotas according to organizational priorities, reduce resource idling, and observe cluster GPU and fabric health. | ||
Konduktor is a platform designed for running ML batch jobs and managing GPU clusters. | ||
This documentation is targeted towards: | ||
|
||
- Easy scale out and job queueing and multi-node scheduling | ||
- Share resources with quotas across projects via namespaces | ||
- Track active and pending jobs and utilization, power usage, etc. | ||
- Node level metrics for monitoring cluster health | ||
- ML Engineers/researchers trying to launch training jobs on Konduktor, either managed by `Trainy <https://trainy.ai/>`_ or self-hosted | ||
- GPU cluster administrators trying to self-host Konduktor | ||
|
||
.. figure:: ./images/architecture.png | ||
:width: 80% | ||
:align: center | ||
:alt: Trainy | ||
For interest in our managed offering, please contact us at [email protected] | ||
|
||
------------ | ||
Key Features | ||
------------ | ||
|
||
- 🚀 Easily scale out and job queueing and multi-node scheduling | ||
|
||
.. code-block:: shell | ||
# create a request | ||
$ sky launch -c dev task.yaml --num-nodes 100 | ||
- ☁ Multi-cloud access | ||
|
||
.. code-block:: shell | ||
# toggle cluster via region | ||
$ sky launch -c dev task.yaml --region gke-cluster | ||
- Custom container support | ||
|
||
.. code-block:: yaml | ||
# task.yaml | ||
resources: | ||
image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3 | ||
run: | | ||
python train.py | ||
- `Track active and pending jobs and utilization, power usage, etc. <admin/observability.html>`_ | ||
|
||
---------------------------- | ||
Managed Features and Roadmap | ||
---------------------------- | ||
- On-prem/reserved support - **Available** ✅ | ||
- GCP on-demand/spot support - **Available** ✅ | ||
- AWS on-emand/spot support - In progress 🚧 | ||
- Azure on-emand/spot support - In progress 🚧 | ||
- Multi-cluster submission - In progress 🚧 | ||
|
||
Documentation | ||
------------- | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Cluster Administration | ||
|
||
admin/installation | ||
admin/observability | ||
admin/controller | ||
:caption: Managed Konduktor | ||
|
||
cloud/getting_started | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Job Scheduling | ||
|
||
usage/quickstart | ||
usage/priorities | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:caption: Self-hosted Cluster Administration | ||
|
||
admin/installation | ||
admin/observability | ||
admin/controller | ||
admin/architecture | ||
|
||
|
||
External Links | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
.. _priorities: | ||
|
||
============================== | ||
Job Priorities and Pre-emption | ||
============================== | ||
|
||
Job priority allows teams to enqueue development workloads, while enabling users to | ||
preempt lower priority resources to free up resources for mission critical high priority | ||
workloads. This page explains to use job priorities with `Kueue <https://kueue.sigs.k8s.io/>`_ and `Skypilot <https://skypilot.readthedocs.io/en/latest/>`_. | ||
|
||
This tutorial requires that you install: | ||
|
||
- Trainy skypilot: :code:`pip install trainy-skypilot-nightly[kubernetes]` | ||
- `kubectl <https://kubernetes.io/docs/reference/kubectl/>`_ | ||
|
||
--------------------------------------------- | ||
Example: Using Skypilot with Kueue Priorities | ||
--------------------------------------------- | ||
|
||
Assuming your cluster administrator has provisioned GPU instances and given you | ||
quota within your cluster, you can request GPUs by specifying. | ||
|
||
- Workload queue: :code:`kueue.x-k8s.io/queue-name: user-queue` | ||
- Workload priority: :code:`kueue.x-k8s.io/priority-class: low-priority` | ||
|
||
Let's define a request for a single :code:`T4:4` instance. | ||
|
||
.. code-block:: yaml | ||
:emphasize-lines: 5-6 | ||
# low.yaml | ||
resources: | ||
accelerators: T4:4 | ||
labels: | ||
kueue.x-k8s.io/queue-name: user-queue # this is assigned by your admin | ||
kueue.x-k8s.io/priority-class: low-priority # specify low priority workload | ||
run: | | ||
echo "hi i'm a low priority job" | ||
sleep 1000000 | ||
and now you can launch the request | ||
|
||
.. code-block:: console | ||
# launch a low priority task | ||
$ sky launch -y -d -c low task.yaml | ||
# list workloads in kueue | ||
$ kubectl get workloads | ||
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE | ||
low-3ce1 user-queue 5m | ||
While this workload is running we can enqueue another higher-priority task. If there is room in the cluster | ||
to fulfill the higher priority workload by preempting lower priority jobs work, Kueue will delete the lower | ||
priority workloads and launch the higher priority ones instead. | ||
|
||
.. code-block:: yaml | ||
:emphasize-lines: 5-6 | ||
# high.yaml | ||
resources: | ||
accelerators: T4:4 | ||
labels: | ||
kueue.x-k8s.io/queue-name: user-queue # this is the same as the queue above | ||
kueue.x-k8s.io/priority-class: high-priority # specify high-priority workload | ||
run: | | ||
echo "hi i'm a high priority job" | ||
sleep 1000000 | ||
and now you can launch the request but now with high priority with. | ||
|
||
.. code-block:: console | ||
# launch a development cluster | ||
$ sky launch -y -d -c high high.yaml | ||
# list workloads in kueue | ||
$ kubectl get workloads | ||
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE | ||
high-3ce1 user-queue 2m | ||
.. tip:: | ||
|
||
Pre-empted tasks are not by default requeued if you use :code:`sky launch`. To have your jobs retried, | ||
we recommend using :code:`sky jobs launch` instead so that when a task is pre-empted, the skypilot | ||
job controller will automatically resubmit your task to Kueue without manual intervention. | ||
|
||
References | ||
---------- | ||
|
||
- `The original guide <https://github.com/skypilot-org/skypilot/tree/k8s_kueue_example/examples/kueue>`_ by Romil Bhardwaj. | ||
- `Kueue priority docs <https://kueue.sigs.k8s.io/docs/concepts/workload_priority_class/>`_ | ||
|
Oops, something went wrong.