Skip to content

Commit

Permalink
[Docs] Key features and managed konduktor specs (#57)
Browse files Browse the repository at this point in the history
* fix selectors for memory gauge

* include nv_peermem errors from dmesg

* lint

* format

* patch space in regex

* fix mem, add exporter ip to node mapping

* update docs
  • Loading branch information
asaiacai authored Sep 17, 2024
1 parent 56dc90e commit ffecd93
Show file tree
Hide file tree
Showing 6 changed files with 366 additions and 21 deletions.
100 changes: 100 additions & 0 deletions docs/source/admin/architecture.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
.. _architecture:

============
Architecture
============

.. figure:: ../images/architecture.png
:width: 80%
:align: center
:alt: Trainy

Konduktor was built with the following objectives in mind.

#. ML Engineers who can train on multi-gpu already should be able to scale across nodes with little to no code changes and bring their favorite frameworks (PyTorch, Lightning, HuggingFace, Deepspeed, etc.)
#. Support multi-tenancy and resource sharing via quotas
#. Observability and Auto-healing to gracefully handle GPU/hardware errors

which led to us building on `Kubernetes <https://kubernetes.io/>`_ as well as integrating with the following tools.

#. `SkyPilot <https://skypilot.readthedocs.io/en/latest/>`_ - supports easy scaleout over nodes with declarative resource requests

.. code-block:: yaml
:emphasize-lines: 4-4
resources:
accelerators: H100:8
num_nodes: 100
run: |
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--nnodes=$num_nodes --node_rank=$SKYPILOT_NODE_RANK --master_addr=$master_addr \
--master_port=8008 resnet_ddp.py --num_epochs 20
#. `Kueue <https://kueue.sigs.k8s.io/>`_ - declarative resource quotas, sharing, and job pre-emption via workload queues/priorities

- ML Engineers only have to specify which queues they want to submit to
.. code-block:: yaml
:emphasize-lines: 4-5
resources:
accelerators: H100:8
labels:
kueue.x-k8s.io/queue-name: user-queue # this is the same as the queue above
kueue.x-k8s.io/priority-class: high-priority # specify high-priority workload
num_nodes: 100
run: |
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--nnodes=$num_nodes --node_rank=$SKYPILOT_NODE_RANK --master_addr=$master_addr \
--master_port=8008 resnet_ddp.py --num_epochs 20
- Cluster administrators can set GPU quotas by team via resource flavors and queues.
.. code-block:: yaml
:emphasize-lines: 27-28
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "default-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cluster-queue"
spec:
preemption:
reclaimWithinCohort: Any
borrowWithinCohort:
policy: LowerPriority
maxPriorityThreshold: 100
withinClusterQueue: LowerPriority
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: "default-flavor"
resources:
- name: "cpu"
nominalQuota: 10000
- name: "memory"
nominalQuota: 10000Gi
- name: "nvidia.com/gpu"
nominalQuota: 8 # REPLACE THIS
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: "user-queue"
spec:
clusterQueue: "cluster-queue"
#. `Prometheus <https://prometheus.io/>`_, `Grafana <https://grafana.com/oss/grafana/>`_, `OpenTelemetry <https://opentelemetry.io/>`_, `Loki <https://grafana.com/oss/loki/>`_: Exporters for both GPU metrics and kernel logs as well as metrics and logging backends for longer term storage and querying. Health controllers run in a loop to detect and cordon faulty nodes preventing jobs from landing on bad hardware repeatedly ensuring continuous delivery while cluster admins have immediate feedback on where & what is causing an alert from one location.
8 changes: 8 additions & 0 deletions docs/source/admin/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ This section is for k8s admins who are first deploying the necessary resources o
- `OpenTelemetry <https://opentelemetry.io/>`_ - Log publishing
- `Kueue <https://kueue.sigs.k8s.io/>`_ - workload scheduling and resource quotas/sharing

For a more thorough explanation of the Konduktor stack, see :doc:`architecture`

Prerequisites
=============

Expand Down Expand Up @@ -65,6 +67,12 @@ Installing the DCGM exporter is best handled using NVIDIA's `gpu-operator <https
nvidia-driver-daemonset-fvx9z 1/1 Running 0 9d
nvidia-operator-validator-62dhx 1/1 Running 0 14d
.. tip::

This guide currently works for on-prem bare metal deployments.
We are still validating on how to deploy :code:`nvidia-dcgm-exporter`
on managed k8s solutions like AWS's **EKS** and Google's **GKE**. Stay tuned for updates!

Prometheus-Grafana Stack
------------------------

Expand Down
61 changes: 61 additions & 0 deletions docs/source/cloud/getting_started.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
.. _getting_started:

===============
Getting Started
===============

To get access to your Trainy: Konduktor clusters, work with your account manager to add all devices that
will require cluster access as client configurations in :code:`~/.sky/config.yaml`.
Trainy provides isolated access to clusters via `Tailscale <https://tailscale.com/>`_, which you
will need to `install <https://tailscale.com/kb/1347/installation>`_ on your development machine.
You can see and connect to the clusters you have access to with:

.. code-block:: bash
# list clusters
$ tailscale status
100.85.126.7 awesomecorp-laptop awesomecorp-laptop.taila1933c.ts.net macOS -
100.95.60.42 awesomecorp-gke1 tagged-devices linux idle, tx 39656 rx 1038824
100.90.169.2 awesomecorp-gke2 tagged-devices linux -
# configure connection to a cluster
$ tailscale configure kubeconfig awesomecorp-gke1
# check that k8s credentials work
$ sky check
Checking credentials to enable clouds for SkyPilot.
Kubernetes: enabled
Once you are connected, you can start `running jobs <usage/quickstart.html>` on your cluster.

===================
Node Specifications
===================

Trainy managed Konduktor comes with clusters preconfigured and validated with the right drivers and software
for running workloads on GPUs enabled with high-performance networking so you can start training
at scale without having to configure, autoscale, upgrade GPU infrastructure. The following clouds support
autoscaling:

- **GCP (a3-ultragpu), H100-80GB-MEGA:8, 1.6Tbps, 192vCPUs, 1TB RAM, 2TB disk** - **Available** ✅
- AWS on-demand/spot support, H100:8 3.2Tbps - In progress 🚧
- Azure on-demand/spot support, H100:8 3.2Tbps - In progress 🚧

On our autoscaling clusters, for now we only support :code:`H100:8` or :code:`H100-80GB-MEGA:8` instances, which
can be requested as.

.. code-block:: yaml
num_nodes: 2 # scale up number of nodes
resources:
image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3 # specify your image
accelerators: H100-80GB-MEGA:8 # specify the right gpu type
cpus: 192+ # 192 CPUs
memory: 1000+ # 1TB of RAM
cloud: kubernetes
labels:
kueue.x-k8s.io/queue-name: user-queue # this is assigned by your admin
kueue.x-k8s.io/priority-class: low-priority
72 changes: 58 additions & 14 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,34 +30,78 @@ Welcome to Konduktor's documentation!
<strong>Batch Jobs and Cluster Management for GPUs on Kubernetes</strong>
</p>

Konduktor is a platform designed for running ML batch jobs and managing GPU clusters. Konduktor uses existing open source tools to build a platform that empowers ML engineers by abstracting away the details of resource scheduling so they can focus on modeling. Cluster administrators will enjoy setting resource quotas and sharing between projects as well as built in monitoring to track cluster-wide resource utilization and pending jobs to adjust quotas according to organizational priorities, reduce resource idling, and observe cluster GPU and fabric health.
Konduktor is a platform designed for running ML batch jobs and managing GPU clusters.
This documentation is targeted towards:

- Easy scale out and job queueing and multi-node scheduling
- Share resources with quotas across projects via namespaces
- Track active and pending jobs and utilization, power usage, etc.
- Node level metrics for monitoring cluster health
- ML Engineers/researchers trying to launch training jobs on Konduktor, either managed by `Trainy <https://trainy.ai/>`_ or self-hosted
- GPU cluster administrators trying to self-host Konduktor

.. figure:: ./images/architecture.png
:width: 80%
:align: center
:alt: Trainy
For interest in our managed offering, please contact us at [email protected]

------------
Key Features
------------

- 🚀 Easily scale out and job queueing and multi-node scheduling

.. code-block:: shell
# create a request
$ sky launch -c dev task.yaml --num-nodes 100
- ☁ Multi-cloud access

.. code-block:: shell
# toggle cluster via region
$ sky launch -c dev task.yaml --region gke-cluster
- Custom container support

.. code-block:: yaml
# task.yaml
resources:
image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3
run: |
python train.py
- `Track active and pending jobs and utilization, power usage, etc. <admin/observability.html>`_

----------------------------
Managed Features and Roadmap
----------------------------
- On-prem/reserved support - **Available** ✅
- GCP on-demand/spot support - **Available** ✅
- AWS on-emand/spot support - In progress 🚧
- Azure on-emand/spot support - In progress 🚧
- Multi-cluster submission - In progress 🚧

Documentation
-------------

.. toctree::
:maxdepth: 1
:caption: Cluster Administration

admin/installation
admin/observability
admin/controller
:caption: Managed Konduktor

cloud/getting_started

.. toctree::
:maxdepth: 1
:caption: Job Scheduling

usage/quickstart
usage/priorities

.. toctree::
:maxdepth: 1
:caption: Self-hosted Cluster Administration

admin/installation
admin/observability
admin/controller
admin/architecture


External Links
Expand Down
98 changes: 98 additions & 0 deletions docs/source/usage/priorities.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
.. _priorities:

==============================
Job Priorities and Pre-emption
==============================

Job priority allows teams to enqueue development workloads, while enabling users to
preempt lower priority resources to free up resources for mission critical high priority
workloads. This page explains to use job priorities with `Kueue <https://kueue.sigs.k8s.io/>`_ and `Skypilot <https://skypilot.readthedocs.io/en/latest/>`_.

This tutorial requires that you install:

- Trainy skypilot: :code:`pip install trainy-skypilot-nightly[kubernetes]`
- `kubectl <https://kubernetes.io/docs/reference/kubectl/>`_

---------------------------------------------
Example: Using Skypilot with Kueue Priorities
---------------------------------------------

Assuming your cluster administrator has provisioned GPU instances and given you
quota within your cluster, you can request GPUs by specifying.

- Workload queue: :code:`kueue.x-k8s.io/queue-name: user-queue`
- Workload priority: :code:`kueue.x-k8s.io/priority-class: low-priority`

Let's define a request for a single :code:`T4:4` instance.

.. code-block:: yaml
:emphasize-lines: 5-6
# low.yaml
resources:
accelerators: T4:4
labels:
kueue.x-k8s.io/queue-name: user-queue # this is assigned by your admin
kueue.x-k8s.io/priority-class: low-priority # specify low priority workload
run: |
echo "hi i'm a low priority job"
sleep 1000000
and now you can launch the request

.. code-block:: console
# launch a low priority task
$ sky launch -y -d -c low task.yaml
# list workloads in kueue
$ kubectl get workloads
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
low-3ce1 user-queue 5m
While this workload is running we can enqueue another higher-priority task. If there is room in the cluster
to fulfill the higher priority workload by preempting lower priority jobs work, Kueue will delete the lower
priority workloads and launch the higher priority ones instead.

.. code-block:: yaml
:emphasize-lines: 5-6
# high.yaml
resources:
accelerators: T4:4
labels:
kueue.x-k8s.io/queue-name: user-queue # this is the same as the queue above
kueue.x-k8s.io/priority-class: high-priority # specify high-priority workload
run: |
echo "hi i'm a high priority job"
sleep 1000000
and now you can launch the request but now with high priority with.

.. code-block:: console
# launch a development cluster
$ sky launch -y -d -c high high.yaml
# list workloads in kueue
$ kubectl get workloads
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
high-3ce1 user-queue 2m
.. tip::

Pre-empted tasks are not by default requeued if you use :code:`sky launch`. To have your jobs retried,
we recommend using :code:`sky jobs launch` instead so that when a task is pre-empted, the skypilot
job controller will automatically resubmit your task to Kueue without manual intervention.

References
----------

- `The original guide <https://github.com/skypilot-org/skypilot/tree/k8s_kueue_example/examples/kueue>`_ by Romil Bhardwaj.
- `Kueue priority docs <https://kueue.sigs.k8s.io/docs/concepts/workload_priority_class/>`_

Loading

0 comments on commit ffecd93

Please sign in to comment.