Skip to content

Commit

Permalink
emphasize important code lines, remove sections for manually configur…
Browse files Browse the repository at this point in the history
…ing loki and grafana
  • Loading branch information
asaiacai committed Jul 16, 2024
1 parent c0a7302 commit 413cfee
Show file tree
Hide file tree
Showing 4 changed files with 42 additions and 36 deletions.
15 changes: 8 additions & 7 deletions docs/source/admin/controller.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,13 +69,14 @@ while the controller is running:
dmesg-2x225 1/1 Running 0 10h
$ kubectl exec -it -n dmesg-logging dmesg-2x225 -- bash
$ echo "[1235733.431527] NVRM: Xid (PCI:0000:4e:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus." > /dev/kmsg
$ echo "NVRM: Xid (PCI:0000:4e:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus." > /dev/kmsg
After which you should see in your controller logs

.. code-block:: console
:emphasize-lines: 1-1,4-4
[I 07-09 05:37:45 parse.py:128] node `gke-a3-cluster-gpu-pool-2d164072-zz64` has dmesg error: [538441.007373] [1235733.431527] NVRM: Xid (PCI:0000:4e:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[I 07-09 05:37:45 parse.py:128] node `gke-a3-cluster-gpu-pool-2d164072-zz64` has dmesg error: [538441.007373] NVRM: Xid (PCI:0000:4e:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[W 07-09 05:37:45 kube_client.py:27] incluster config failed to load, attempting to use kubeconfig.
[I 07-09 05:37:45 kube_client.py:31] KUBECONFIG loaded
[I 07-09 05:37:45 node.py:98] Node gke-a3-cluster-gpu-pool-2d164072-zz64 tainted.
Expand All @@ -99,8 +100,8 @@ You can remove all the taints in the cluster with :code:`konduktor reset`
Features and Roadmap
====================
- :code:`dmesg` error detection - **Available**
- In-cluster deployment of controller - In progress
- Pod log error detection - In progress
- Health Checks (Taint Removal) - In progress
- Node Resolution Hooks (Reboot, Power Cycle) - In progress
- :code:`dmesg` error detection - **Available**
- In-cluster deployment of controller - **Available** ✅
- Pod log error detection - In progress 🚧
- Health Checks (Taint Removal) - In progress 🚧
- Node Resolution Hooks (Reboot, Power Cycle) - In progress 🚧
39 changes: 24 additions & 15 deletions docs/source/admin/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -80,24 +80,26 @@ To setup the monitoring stack, we're maintaining our own `default values to get
https://prometheus-community.github.io/helm-charts
# install prometheus stack
$ helm install prometheus-community/kube-prometheus-stack --create-namespace \
--create-namespace \
$ helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--create-namespace \
--namespace prometheus \
--values kube-prometheus-stack.values
--values kube-prometheus-stack.values
# check prometheus stack is up
$ kubectl get pods -n prometheus
NAME READY STATUS RESTARTS AGE
alertmanager-kube-prometheus-stack-1717-alertmanager-0 2/2 Running 0 3d15h
kube-prometheus-stack-1717-operator-6d9487489d-2vx8l 1/1 Running 0 3d15h
kube-prometheus-stack-1717404158-grafana-6d48845b9-qf5qr 3/3 Running 0 3d15h
kube-prometheus-stack-1717404158-kube-state-metrics-7c97ffbfxzt 1/1 Running 0 3d15h
kube-prometheus-stack-1717404158-prometheus-node-exporter-2vh6j 1/1 Running 0 3d15h
kube-prometheus-stack-1717404158-prometheus-node-exporter-68ldt 1/1 Running 0 3d15h
kube-prometheus-stack-1717404158-prometheus-node-exporter-frd65 1/1 Running 0 3d15h
kube-prometheus-stack-1717404158-prometheus-node-exporter-mxhpb 1/1 Running 0 3d15h
prometheus-kube-prometheus-stack-1717-prometheus-0 2/2 Running 0 3d15h
NAME READY STATUS RESTARTS AGE
alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 53s
kube-prometheus-stack-grafana-79f9ccf77-wccpt 3/3 Running 0 56s
kube-prometheus-stack-kube-state-metrics-b7b54458-klcb4 1/1 Running 0 56s
kube-prometheus-stack-operator-74774b4dbd-bdzsr 1/1 Running 0 56s
kube-prometheus-stack-prometheus-node-exporter-74245 1/1 Running 0 57s
kube-prometheus-stack-prometheus-node-exporter-8t5ct 1/1 Running 0 56s
kube-prometheus-stack-prometheus-node-exporter-bp8cb 1/1 Running 0 57s
kube-prometheus-stack-prometheus-node-exporter-ttj5b 1/1 Running 0 56s
kube-prometheus-stack-prometheus-node-exporter-z8rzn 1/1 Running 0 57s
prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 0 53s
OpenTelemetry-Loki Logging Stack
--------------------------------

Expand All @@ -114,8 +116,14 @@ the stack via Helm. We also deploy a daemonset to stream dmesg logs from each no
$ helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
$ helm repo update
$ helm install --values loki.values loki --namespace=loki grafana/loki --create-namespace
$ helm install --values otel.values otel-collector --namespace=otel-collector open-telemetry/opentelemetry-collector --create-namespace
$ helm install loki grafana/loki \
--create-namespace \
--namespace=loki \
--values loki.values
$ helm install otel-collector open-telemetry/opentelemetry-collector \
--create-namespace \
--namespace=otel-collector \
--values otel.values
$ kubectl apply -f https://raw.githubusercontent.com/Trainy-ai/konduktor/main/konduktor/manifests/dmesg_daemonset.yaml
$ kubectl get pods -n loki
Expand Down Expand Up @@ -171,6 +179,7 @@ Resource quotas are defined via ClusterQueues and LocalQueues which are assigned
Within :code:`single-clusterqueue-setup.yaml`, be sure to replace :code:`<num-GPUs-in-cluster>` with the total number of GPUs in your cluster.

.. code-block:: yaml
:emphasize-lines: 28-28
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
Expand Down
23 changes: 9 additions & 14 deletions docs/source/admin/observability.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,27 +23,22 @@ Access Grafana
A local Grafana instance is deployed as part of the observability stack.
The dashboard shows an overview of the available GPUs, pending/active workloads, and over all cluster utilization.

.. code-block:: console
# get the service name
$ kubectl get svc -n prometheus | grep grafana
kube-prometheus-stack-grafana ClusterIP 10.122.81.251 <none> 80/TCP 4d2h
We can use :code:`kubectl port-forward` to access the grafana service from our laptop. For the example above,

.. code-block:: console
$ kubectl port-forward -n prometheus svc/kube-prometheus-stack-grafana 3000:80
In the example above, we can enter :code:`https://localhost:3000` into a browser window where it will prompt for a password.
In the example above, we can enter :code:`https://localhost:3000/` into a browser window where it will prompt for a password.
The default username is :code:`admin` with the password being set by :code:`kube-prometheus-stack.values` in :doc:`/admin/installation`.
**Administrators should secure this endpoint as well as changing the authentication login.**

Afterwards navigate to **Dashboards -> Konduktor** to access our provided dashboard

Metrics Dashboard
-----------------

After logging in, you can `import <https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/import-dashboards/>`_ our default dashboard by either using the `JSON definition from the repo <https://github.com/Trainy-ai/konduktor/tree/main/grafana>`_ under :code:`grafana/default_grafana_dashboard.json`
or by downloading from `our Grafana published dashboard <https://grafana.com/grafana/dashboards/21231-konduktor/>`_.
Our metrics dashboard is included in the :code:`kube-prometheus-stack` installation using the `JSON definition from the repo <https://github.com/Trainy-ai/konduktor/tree/main/grafana>`_ under :code:`grafana/default_grafana_dashboard.json`
A interactive sample dashboard can be found `here <https://snapshots.raintank.io/dashboard/snapshot/qJUzCCb4nLspDAJfGKd4EexUKJEmvEvu>`_.

To track cluster GPU utilization, useful metrics to track include:
Expand All @@ -70,12 +65,12 @@ Node level stats include:
Reading Logs
------------

Grafana provides views for querying and filtering logs from pods and nodes.
First `add Loki as a data source <https://grafana.com/docs/loki/latest/visualize/grafana/>`_,
setting the URL to be :code:`http://loki.loki.svc.cluster.local:3100` and create a new dashboard
with your newly created Loki datasource and begin querying your logs by node, namespace, etc.
Included in the installation is a Loki logging backend and datasource.

Our default dashboard includes a panel for listing error logs from pods in the :code:`default` namespace.
As well as (S)Xid errors by following :code:`dmesg` on each node. You can also perform arbitrary
`LogQL <https://grafana.com/docs/loki/latest/query/>`_ queries by visiting the **Explore** tab.

Our default dashboard includes a panel for listing error logs from pods in the :code:`default` namespace.

.. figure:: ../images/otel-loki.png
:width: 120%
Expand Down
1 change: 1 addition & 0 deletions docs/source/usage/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ To scale up the job size over multiple nodes, we just change :code:`task.yaml` t
We define a script for each node to run.

.. code-block:: yaml
:emphasize-lines: 12-12,22-23,25-25
resources:
image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3
Expand Down

0 comments on commit 413cfee

Please sign in to comment.