[Docs] Key features and managed konduktor specs (#57)

* fix selectors for memory gauge * include nv_peermem errors from dmesg * lint * format * patch space in regex * fix mem, add exporter ip to node mapping * update docs
Trainy-ai · Sep 17, 2024 · ffecd93 · ffecd93
1 parent 56dc90e
commit ffecd93
Show file tree

Hide file tree

Showing 6 changed files with 366 additions and 21 deletions.
diff --git a/docs/source/admin/architecture.rst b/docs/source/admin/architecture.rst
@@ -0,0 +1,100 @@
+.. _architecture:
+
+============
+Architecture
+============
+
+.. figure:: ../images/architecture.png
+   :width: 80%
+   :align: center
+   :alt: Trainy
+
+Konduktor was built with the following objectives in mind.
+
+#. ML Engineers who can train on multi-gpu already should be able to scale across nodes with little to no code changes and bring their favorite frameworks (PyTorch, Lightning, HuggingFace, Deepspeed, etc.)
+#. Support multi-tenancy and resource sharing via quotas
+#. Observability and Auto-healing to gracefully handle GPU/hardware errors
+
+which led to us building on `Kubernetes <https://kubernetes.io/>`_ as well as integrating with the following tools.
+
+#. `SkyPilot <https://skypilot.readthedocs.io/en/latest/>`_ - supports easy scaleout over nodes with declarative resource requests
+
+    .. code-block:: yaml
+        :emphasize-lines: 4-4
+
+        resources:
+            accelerators: H100:8
+
+        num_nodes: 100
+
+        run: |
+            num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
+            master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
+            python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
+            --nnodes=$num_nodes --node_rank=$SKYPILOT_NODE_RANK --master_addr=$master_addr \
+            --master_port=8008 resnet_ddp.py --num_epochs 20
+
+
+#. `Kueue <https://kueue.sigs.k8s.io/>`_ - declarative resource quotas, sharing, and job pre-emption via workload queues/priorities
+
+    - ML Engineers only have to specify which queues they want to submit to
+        .. code-block:: yaml
+            :emphasize-lines: 4-5
+
+            resources:
+                accelerators: H100:8
+                labels:
+                    kueue.x-k8s.io/queue-name: user-queue # this is the same as the queue above
+                    kueue.x-k8s.io/priority-class: high-priority # specify high-priority workload
+
+            num_nodes: 100
+
+            run: |
+                num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
+                master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
+                python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
+                --nnodes=$num_nodes --node_rank=$SKYPILOT_NODE_RANK --master_addr=$master_addr \
+                --master_port=8008 resnet_ddp.py --num_epochs 20
+
+    - Cluster administrators can set GPU quotas by team via resource flavors and queues.
+        .. code-block:: yaml
+            :emphasize-lines: 27-28
+
+            apiVersion: kueue.x-k8s.io/v1beta1
+            kind: ResourceFlavor
+            metadata:
+            name: "default-flavor"
+            ---
+            apiVersion: kueue.x-k8s.io/v1beta1
+            kind: ClusterQueue
+            metadata:
+            name: "cluster-queue"
+            spec:
+            preemption:
+                reclaimWithinCohort: Any
+                borrowWithinCohort:
+                policy: LowerPriority
+                maxPriorityThreshold: 100
+                withinClusterQueue: LowerPriority
+            namespaceSelector: {} # match all.
+            resourceGroups:
+            - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
+                flavors:
+                - name: "default-flavor"
+                resources:
+                - name: "cpu"
+                    nominalQuota: 10000
+                - name: "memory"
+                    nominalQuota: 10000Gi
+                - name: "nvidia.com/gpu"
+                    nominalQuota: 8 # REPLACE THIS
+            ---
+            apiVersion: kueue.x-k8s.io/v1beta1
+            kind: LocalQueue
+            metadata:
+            name: "user-queue"
+            spec:
+            clusterQueue: "cluster-queue"
+
+
+#. `Prometheus <https://prometheus.io/>`_, `Grafana <https://grafana.com/oss/grafana/>`_, `OpenTelemetry <https://opentelemetry.io/>`_, `Loki <https://grafana.com/oss/loki/>`_: Exporters for both GPU metrics and kernel logs as well as metrics and logging backends for longer term storage and querying. Health controllers run in a loop to detect and cordon faulty nodes preventing jobs from landing on bad hardware repeatedly ensuring continuous delivery while cluster admins have immediate feedback on where & what is causing an alert from one location.
diff --git a/docs/source/admin/installation.rst b/docs/source/admin/installation.rst
@@ -12,6 +12,8 @@ This section is for k8s admins who are first deploying the necessary resources o
 - `OpenTelemetry <https://opentelemetry.io/>`_ - Log publishing
 - `Kueue <https://kueue.sigs.k8s.io/>`_ - workload scheduling and resource quotas/sharing
 
+For a more thorough explanation of the Konduktor stack, see :doc:`architecture`
+
 Prerequisites
 =============
 
@@ -65,6 +67,12 @@ Installing the DCGM exporter is best handled using NVIDIA's `gpu-operator <https
     nvidia-driver-daemonset-fvx9z                                     1/1     Running     0               9d
     nvidia-operator-validator-62dhx                                   1/1     Running     0               14d
 
+.. tip::
+
+    This guide currently works for on-prem bare metal deployments.
+    We are still validating on how to deploy :code:`nvidia-dcgm-exporter` 
+    on managed k8s solutions like AWS's **EKS** and Google's **GKE**. Stay tuned for updates!
+
 Prometheus-Grafana Stack
 ------------------------
 

diff --git a/docs/source/cloud/getting_started.rst b/docs/source/cloud/getting_started.rst
@@ -0,0 +1,61 @@
+.. _getting_started:
+
+===============
+Getting Started
+===============
+
+To get access to your Trainy: Konduktor clusters, work with your account manager to add all devices that
+will require cluster access as client configurations in :code:`~/.sky/config.yaml`. 
+Trainy provides isolated access to clusters via `Tailscale <https://tailscale.com/>`_, which you 
+will need to `install <https://tailscale.com/kb/1347/installation>`_ on your development machine. 
+You can see and connect to the clusters you have access to with:
+
+.. code-block:: bash
+
+    # list clusters
+    $ tailscale status
+    100.85.126.7  awesomecorp-laptop   awesomecorp-laptop.taila1933c.ts.net macOS  -
+    100.95.60.42  awesomecorp-gke1     tagged-devices linux  idle, tx 39656 rx 1038824
+    100.90.169.2  awesomecorp-gke2     tagged-devices linux  -
+
+    # configure connection to a cluster
+    $ tailscale configure kubeconfig awesomecorp-gke1 
+
+    # check that k8s credentials work
+    $ sky check
+    Checking credentials to enable clouds for SkyPilot.
+    Kubernetes: enabled
+
+Once you are connected, you can start `running jobs <usage/quickstart.html>` on your cluster.
+
+===================
+Node Specifications
+===================
+
+Trainy managed Konduktor comes with clusters preconfigured and validated with the right drivers and software 
+for running workloads on GPUs enabled with high-performance networking so you can start training
+at scale without having to configure, autoscale, upgrade GPU infrastructure. The following clouds support
+autoscaling:
+
+- **GCP (a3-ultragpu), H100-80GB-MEGA:8, 1.6Tbps, 192vCPUs, 1TB RAM, 2TB disk** - **Available** ✅
+- AWS on-demand/spot support, H100:8 3.2Tbps - In progress 🚧
+- Azure on-demand/spot support, H100:8 3.2Tbps - In progress 🚧
+
+On our autoscaling clusters, for now we only support :code:`H100:8` or :code:`H100-80GB-MEGA:8` instances, which
+can be requested as.
+
+.. code-block:: yaml
+
+    num_nodes: 2 # scale up number of nodes
+
+    resources:
+        image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3 # specify your image
+        accelerators: H100-80GB-MEGA:8 # specify the right gpu type
+        cpus: 192+ # 192 CPUs
+        memory: 1000+ # 1TB of RAM
+        cloud: kubernetes
+        labels:
+            kueue.x-k8s.io/queue-name: user-queue # this is assigned by your admin
+            kueue.x-k8s.io/priority-class: low-priority
+
+
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -30,34 +30,78 @@ Welcome to Konduktor's documentation!
    <strong>Batch Jobs and Cluster Management for GPUs on Kubernetes</strong>
    </p>
 
-   Konduktor is a platform designed for running ML batch jobs and managing GPU clusters. Konduktor uses existing open source tools to build a platform that empowers ML engineers by abstracting away the details of resource scheduling so they can focus on modeling. Cluster administrators will enjoy setting resource quotas and sharing between projects as well as built in monitoring to track cluster-wide resource utilization and pending jobs to adjust quotas according to organizational priorities, reduce resource idling, and observe cluster GPU and fabric health.
+Konduktor is a platform designed for running ML batch jobs and managing GPU clusters. 
+This documentation is targeted towards:
 
-- Easy scale out and job queueing and multi-node scheduling
-- Share resources with quotas across projects via namespaces
-- Track active and pending jobs and utilization, power usage, etc.
-- Node level metrics for monitoring cluster health
+- ML Engineers/researchers trying to launch training jobs on Konduktor, either managed by `Trainy <https://trainy.ai/>`_ or self-hosted
+- GPU cluster administrators trying to self-host Konduktor
 
-.. figure:: ./images/architecture.png
-   :width: 80%
-   :align: center
-   :alt: Trainy
+For interest in our managed offering, please contact us at [email protected]
+
+------------
+Key Features
+------------
+
+- 🚀 Easily scale out and job queueing and multi-node scheduling
+
+.. code-block:: shell
+
+   # create a request
+   $ sky launch -c dev task.yaml --num-nodes 100
+
+- ☁ Multi-cloud access
+
+.. code-block:: shell
+
+   # toggle cluster via region
+   $ sky launch -c dev task.yaml --region gke-cluster
+
+- Custom container support
+
+.. code-block:: yaml
+
+   # task.yaml
+   resources:
+      image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3
+
+   run: |
+      python train.py
+
+- `Track active and pending jobs and utilization, power usage, etc. <admin/observability.html>`_
+
+----------------------------
+Managed Features and Roadmap
+----------------------------
+- On-prem/reserved support - **Available** ✅
+- GCP on-demand/spot support - **Available** ✅
+- AWS on-emand/spot support - In progress 🚧
+- Azure on-emand/spot support - In progress 🚧
+- Multi-cluster submission - In progress 🚧
 
 Documentation
 -------------
 
 .. toctree::
    :maxdepth: 1
-   :caption: Cluster Administration
-
-   admin/installation
-   admin/observability
-   admin/controller
+   :caption: Managed Konduktor
+
+   cloud/getting_started
 
 .. toctree::
    :maxdepth: 1
    :caption: Job Scheduling
 
    usage/quickstart
+   usage/priorities
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Self-hosted Cluster Administration
+
+   admin/installation
+   admin/observability
+   admin/controller
+   admin/architecture
 
 
 External Links

diff --git a/docs/source/usage/priorities.rst b/docs/source/usage/priorities.rst
@@ -0,0 +1,98 @@
+.. _priorities:
+
+==============================
+Job Priorities and Pre-emption
+==============================
+
+Job priority allows teams to enqueue development workloads, while enabling users to 
+preempt lower priority resources to free up resources for mission critical high priority
+workloads. This page explains to use job priorities with `Kueue <https://kueue.sigs.k8s.io/>`_ and `Skypilot <https://skypilot.readthedocs.io/en/latest/>`_.
+
+This tutorial requires that you install:
+
+- Trainy skypilot: :code:`pip install trainy-skypilot-nightly[kubernetes]`
+- `kubectl <https://kubernetes.io/docs/reference/kubectl/>`_
+
+---------------------------------------------
+Example: Using Skypilot with Kueue Priorities
+---------------------------------------------
+
+Assuming your cluster administrator has provisioned GPU instances and given you 
+quota within your cluster, you can request GPUs by specifying.
+
+- Workload queue: :code:`kueue.x-k8s.io/queue-name: user-queue`
+- Workload priority: :code:`kueue.x-k8s.io/priority-class: low-priority`
+
+Let's define a request for a single :code:`T4:4` instance.
+
+.. code-block:: yaml
+    :emphasize-lines: 5-6
+
+    # low.yaml
+    resources:
+        accelerators: T4:4
+        labels:
+            kueue.x-k8s.io/queue-name: user-queue # this is assigned by your admin
+            kueue.x-k8s.io/priority-class: low-priority # specify low priority workload
+
+    run: |
+        echo "hi i'm a low priority job"
+        sleep 1000000
+
+and now you can launch the request
+
+.. code-block:: console
+
+    # launch a low priority task
+    $ sky launch -y -d -c low task.yaml
+
+
+    # list workloads in kueue
+    $ kubectl get workloads
+    NAME                 QUEUE        RESERVED IN     ADMITTED   FINISHED   AGE
+    low-3ce1             user-queue                                         5m
+
+While this workload is running we can enqueue another higher-priority task. If there is room in the cluster
+to fulfill the higher priority workload by preempting lower priority jobs work, Kueue will delete the lower
+priority workloads and launch the higher priority ones instead.
+
+.. code-block:: yaml
+    :emphasize-lines: 5-6
+
+    # high.yaml
+    resources:
+        accelerators: T4:4
+        labels:
+            kueue.x-k8s.io/queue-name: user-queue # this is the same as the queue above
+            kueue.x-k8s.io/priority-class: high-priority # specify high-priority workload
+
+    run: |
+        echo "hi i'm a high priority job"
+        sleep 1000000
+
+
+and now you can launch the request but now with high priority with.
+
+.. code-block:: console
+
+    # launch a development cluster 
+    $ sky launch -y -d -c high high.yaml
+
+
+    # list workloads in kueue
+    $ kubectl get workloads
+    NAME                 QUEUE        RESERVED IN     ADMITTED   FINISHED   AGE
+    high-3ce1            user-queue                                         2m
+
+.. tip::
+
+    Pre-empted tasks are not by default requeued if you use :code:`sky launch`. To have your jobs retried,
+    we recommend using :code:`sky jobs launch` instead so that when a task is pre-empted, the skypilot
+    job controller will automatically resubmit your task to Kueue without manual intervention.
+
+References
+----------
+
+- `The original guide <https://github.com/skypilot-org/skypilot/tree/k8s_kueue_example/examples/kueue>`_ by Romil Bhardwaj.
+- `Kueue priority docs <https://kueue.sigs.k8s.io/docs/concepts/workload_priority_class/>`_
+