emphasize important code lines, remove sections for manually configur…

…ing loki and grafana
Trainy-ai · Jul 16, 2024 · 413cfee · 413cfee
1 parent c0a7302
commit 413cfee
Show file tree

Hide file tree

Showing 4 changed files with 42 additions and 36 deletions.
diff --git a/docs/source/admin/controller.rst b/docs/source/admin/controller.rst
@@ -69,13 +69,14 @@ while the controller is running:
     dmesg-2x225   1/1     Running   0          10h
 
     $ kubectl exec -it -n dmesg-logging dmesg-2x225 -- bash
-    $ echo "[1235733.431527] NVRM: Xid (PCI:0000:4e:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus." > /dev/kmsg
+    $ echo "NVRM: Xid (PCI:0000:4e:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus." > /dev/kmsg
 
 After which you should see in your controller logs
 
 .. code-block:: console
+    :emphasize-lines: 1-1,4-4
 
-    [I 07-09 05:37:45 parse.py:128] node `gke-a3-cluster-gpu-pool-2d164072-zz64` has dmesg error: [538441.007373] [1235733.431527] NVRM: Xid (PCI:0000:4e:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
+    [I 07-09 05:37:45 parse.py:128] node `gke-a3-cluster-gpu-pool-2d164072-zz64` has dmesg error: [538441.007373] NVRM: Xid (PCI:0000:4e:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
     [W 07-09 05:37:45 kube_client.py:27] incluster config failed to load, attempting to use kubeconfig.
     [I 07-09 05:37:45 kube_client.py:31] KUBECONFIG loaded
     [I 07-09 05:37:45 node.py:98] Node gke-a3-cluster-gpu-pool-2d164072-zz64 tainted.
@@ -99,8 +100,8 @@ You can remove all the taints in the cluster with :code:`konduktor reset`
 
 Features and Roadmap
 ====================
-- :code:`dmesg` error detection - **Available**
-- In-cluster deployment of controller - In progress
-- Pod log error detection - In progress
-- Health Checks (Taint Removal) - In progress
-- Node Resolution Hooks (Reboot, Power Cycle) - In progress
+- :code:`dmesg` error detection - **Available** ✅
+- In-cluster deployment of controller - **Available** ✅
+- Pod log error detection - In progress 🚧
+- Health Checks (Taint Removal) - In progress 🚧
+- Node Resolution Hooks (Reboot, Power Cycle) - In progress 🚧
diff --git a/docs/source/admin/installation.rst b/docs/source/admin/installation.rst
@@ -80,24 +80,26 @@ To setup the monitoring stack, we're maintaining our own `default values to get
         https://prometheus-community.github.io/helm-charts
 
     # install prometheus stack
-    $ helm install prometheus-community/kube-prometheus-stack --create-namespace \
-        --create-namespace \
+    $ helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
+        --create-namespace \ 
         --namespace prometheus \
-        --values kube-prometheus-stack.values
+        --values kube-prometheus-stack.values 
 
     # check prometheus stack is up
     $ kubectl get pods -n prometheus
-    NAME                                                              READY   STATUS    RESTARTS   AGE
-    alertmanager-kube-prometheus-stack-1717-alertmanager-0            2/2     Running   0          3d15h
-    kube-prometheus-stack-1717-operator-6d9487489d-2vx8l              1/1     Running   0          3d15h
-    kube-prometheus-stack-1717404158-grafana-6d48845b9-qf5qr          3/3     Running   0          3d15h
-    kube-prometheus-stack-1717404158-kube-state-metrics-7c97ffbfxzt   1/1     Running   0          3d15h
-    kube-prometheus-stack-1717404158-prometheus-node-exporter-2vh6j   1/1     Running   0          3d15h
-    kube-prometheus-stack-1717404158-prometheus-node-exporter-68ldt   1/1     Running   0          3d15h
-    kube-prometheus-stack-1717404158-prometheus-node-exporter-frd65   1/1     Running   0          3d15h
-    kube-prometheus-stack-1717404158-prometheus-node-exporter-mxhpb   1/1     Running   0          3d15h
-    prometheus-kube-prometheus-stack-1717-prometheus-0                2/2     Running   0          3d15h
+    NAME                                                      READY   STATUS    RESTARTS   AGE
+    alertmanager-kube-prometheus-stack-alertmanager-0         2/2     Running   0          53s
+    kube-prometheus-stack-grafana-79f9ccf77-wccpt             3/3     Running   0          56s
+    kube-prometheus-stack-kube-state-metrics-b7b54458-klcb4   1/1     Running   0          56s
+    kube-prometheus-stack-operator-74774b4dbd-bdzsr           1/1     Running   0          56s
+    kube-prometheus-stack-prometheus-node-exporter-74245      1/1     Running   0          57s
+    kube-prometheus-stack-prometheus-node-exporter-8t5ct      1/1     Running   0          56s
+    kube-prometheus-stack-prometheus-node-exporter-bp8cb      1/1     Running   0          57s
+    kube-prometheus-stack-prometheus-node-exporter-ttj5b      1/1     Running   0          56s
+    kube-prometheus-stack-prometheus-node-exporter-z8rzn      1/1     Running   0          57s
+    prometheus-kube-prometheus-stack-prometheus-0             2/2     Running   0          53s
 
+    
 OpenTelemetry-Loki Logging Stack
 --------------------------------
 
@@ -114,8 +116,14 @@ the stack via Helm. We also deploy a daemonset to stream dmesg logs from each no
     $ helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
     $ helm repo update
 
-    $ helm install --values loki.values loki --namespace=loki grafana/loki --create-namespace
-    $ helm install --values otel.values otel-collector --namespace=otel-collector open-telemetry/opentelemetry-collector --create-namespace
+    $ helm install loki grafana/loki \
+        --create-namespace \
+        --namespace=loki \
+        --values loki.values
+    $ helm install otel-collector open-telemetry/opentelemetry-collector \
+        --create-namespace \
+        --namespace=otel-collector \
+        --values otel.values
     $ kubectl apply -f https://raw.githubusercontent.com/Trainy-ai/konduktor/main/konduktor/manifests/dmesg_daemonset.yaml
 
     $ kubectl get pods -n loki
@@ -171,6 +179,7 @@ Resource quotas are defined via ClusterQueues and LocalQueues which are assigned
 Within :code:`single-clusterqueue-setup.yaml`, be sure to replace :code:`<num-GPUs-in-cluster>` with the total number of GPUs in your cluster.
 
 .. code-block:: yaml
+    :emphasize-lines: 28-28
 
     apiVersion: kueue.x-k8s.io/v1beta1
     kind: ResourceFlavor

diff --git a/docs/source/admin/observability.rst b/docs/source/admin/observability.rst
@@ -23,27 +23,22 @@ Access Grafana
 A local Grafana instance is deployed as part of the observability stack.
 The dashboard shows an overview of the available GPUs, pending/active workloads, and over all cluster utilization.
 
-.. code-block:: console
-
-    # get the service name
-    $ kubectl get svc -n prometheus | grep grafana
-    kube-prometheus-stack-grafana                    ClusterIP    10.122.81.251   <none>        80/TCP                    4d2h
-
 We can use :code:`kubectl port-forward` to access the grafana service from our laptop. For the example above,
 
 .. code-block:: console
 
     $ kubectl port-forward -n prometheus svc/kube-prometheus-stack-grafana 3000:80
 
-In the example above, we can enter :code:`https://localhost:3000` into a browser window where it will prompt for a password. 
+In the example above, we can enter :code:`https://localhost:3000/` into a browser window where it will prompt for a password. 
 The default username is :code:`admin` with the password being set by :code:`kube-prometheus-stack.values` in :doc:`/admin/installation`.
 **Administrators should secure this endpoint as well as changing the authentication login.**
 
+Afterwards navigate to **Dashboards -> Konduktor** to access our provided dashboard
+
 Metrics Dashboard
 -----------------
 
-After logging in, you can `import <https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/import-dashboards/>`_ our default dashboard by either using the `JSON definition from the repo <https://github.com/Trainy-ai/konduktor/tree/main/grafana>`_ under :code:`grafana/default_grafana_dashboard.json`
-or by downloading from `our Grafana published dashboard <https://grafana.com/grafana/dashboards/21231-konduktor/>`_.
+Our metrics dashboard is included in the :code:`kube-prometheus-stack` installation using the `JSON definition from the repo <https://github.com/Trainy-ai/konduktor/tree/main/grafana>`_ under :code:`grafana/default_grafana_dashboard.json`
 A interactive sample dashboard can be found `here <https://snapshots.raintank.io/dashboard/snapshot/qJUzCCb4nLspDAJfGKd4EexUKJEmvEvu>`_.
 
 To track cluster GPU utilization, useful metrics to track include:
@@ -70,12 +65,12 @@ Node level stats include:
 Reading Logs
 ------------
 
-Grafana provides views for querying and filtering logs from pods and nodes. 
-First `add Loki as a data source <https://grafana.com/docs/loki/latest/visualize/grafana/>`_,
-setting the URL to be :code:`http://loki.loki.svc.cluster.local:3100` and create a new dashboard
-with your newly created Loki datasource and begin querying your logs by node, namespace, etc.
+Included in the installation is a Loki logging backend and datasource.
+
+Our default dashboard includes a panel for listing error logs from pods in the :code:`default` namespace. 
+As well as (S)Xid errors by following :code:`dmesg` on each node. You can also perform arbitrary 
+`LogQL <https://grafana.com/docs/loki/latest/query/>`_ queries by visiting the **Explore** tab.
 
-Our default dashboard includes a panel for listing error logs from pods in the :code:`default` namespace.
 
 .. figure:: ../images/otel-loki.png
    :width: 120%

diff --git a/docs/source/usage/quickstart.rst b/docs/source/usage/quickstart.rst
@@ -68,6 +68,7 @@ To scale up the job size over multiple nodes, we just change :code:`task.yaml` t
 We define a script for each node to run.
 
 .. code-block:: yaml
+    :emphasize-lines: 12-12,22-23,25-25
 
     resources:
         image_id: docker:nvcr.io/nvidia/pytorch:23.10-py3