Skip to content

Latest commit

 

History

History
152 lines (109 loc) · 7.71 KB

README.md

File metadata and controls

152 lines (109 loc) · 7.71 KB

Grafana Dashboards for JupyterHub

Grafana Dashboards for use with Zero to JupyterHub on Kubernetes

Grafana Dasboard Screencast

What?

Grafana dashboards displaying prometheus metrics are extremely useful in diagnosing issues on Kubernetes clusters running JupyterHub. However, everyone has to build their own dashboards - there isn't an easy way to standardize them across many clusters run by many entities.

This project provides some standard Grafana Dashboards as Code to help with this. It uses jsonnet and grafonnet to generate dashboards completely via code. This can then be deployed on any Grafana instance!

Pre-requisites

  1. Locally, you need to have jsonnet installed. The grafonnet library is already vendored in, using jsonnet-builder.

  2. A recent version of prometheus installed on your cluster. Currently, it is assumed that your prometheus instance is installed using the prometheus helm chart, with kube-state-metrics, node-exporter and cadvisor enabled. In addition, you should scrape metrics from the hub instance as well.

  3. A recent version of Grafana, with a prometheus data source already added.

  4. An API key with 'admin' permissions. This is per-organization, and you can make a new one by going to the configuration pane for your Grafana (the gear icon on the left bar), and selecting 'API Keys'. The admin permission is needed to query list of data sources so we can auto-populate template variable options (such as list of hubs).

Deployment

There's a helper deploy.py script that can deploy the dashboards to any grafana installation.

export GRAFANA_TOKEN="<API-TOKEN-FOR-YOUR-GRAFANA>
./deploy.py <your-grafana-url>

This creates a folder called 'JupyterHub Default Dashboards' in your grafana, and adds a couple of dashboards to it.

If your Grafana deployment supports more than one datasource, then apart from the default dashboards in the dashboards directory, you should also consider deploying apart the dashboards in global-dashboards directory.

export GRAFANA_TOKEN="<API-TOKEN-FOR-YOUR-GRAFANA>
./deploy.py <your-grafana-url> --dashboards-dir global-dashboards

The gloabal dashboards will use the list of available dashboards in your Grafana provided to them and will build dashboards across all of them.

NOTE: ANY CHANGES YOU MAKE VIA THE GRAFANA UI WILL BE OVERWRITTEN NEXT TIME YOU RUN deploy.bash. TO MAKE CHANGES, EDIT THE JSONNET FILE AND DEPLOY AGAIN

Prometheus chart version 14.* or newer

If you are using a prometheus chart of a version later than 13.*, then additional configuration for kube-state-metrics needs to be provided because v2.0 of thekube-state-metrics chart that comes with latest prometheus doesn't add any labels by default.

Since these dashboards assume the existence of such labels for pods or nodes, we need to explicitly configure prometheus to track them by populating the list at prometheus.kubeStateMetrics.metricLabelsAllowlist.

prometheus:
   kube-state-metrics:
      metricLabelsAllowlist:
         # to select jupyterhub component pods and get the hub usernames
         - pods=[app,component,hub.jupyter.org/username]
         # allowing all labels is probably fine for nodes, since they don't churn much, unlike pods
         - nodes[*]

Prometheus older than 14.*

If you're using a prometheus chart older than version 14.*, then you can deploy the dashboards available prior to the upgrade, in the 1.0 tag.

Upgrading grafonnet version

The grafonnet jsonnet library is bundled here with jsonnet-bundler. Just running jb update in the git repo root dir after installing jsonnet-bunder should bring you up to speed.

Metrics guidelines

Interpreting prometheus metrics and writing PromQL queries that serve a particular purpose can be difficult. Here are some guidelines to help.

Container memory usage metric

"When will the OOM killer start killing processes in this container?" is the most useful thing for us to know when measuring container memory usage. Of the many container memory metrics, container_memory_working_set_bytes tracks this (see this blog post and this issue). So prefer using that metric as the default for 'memory usage' unless specific reasons exist for using a different metric.

Available metrics

The most common prometheus on kubernetes setup in the JupyterHub community seems to be the prometheus helm chart.

  1. kube-state-metrics (metrics documentation) collects information about various kubernetes objects (pods, services, etc) by scraping the kubernetes API. Anything you can get via kubectl commands, you can probably get via a metric here. Very helpful as a way to query other metrics based on the kubernetes object they represent (like pod, node, etc).

  2. node-exporter (metrics documentation) collects information about each node - CPU usage, memory, disk space, etc. Since hostnames are usually random, you usually join these metrics with kube-state-metrics node metrics to get useful information out. If you are running a manual NFS server, it is recommended to run a node-exporter instance there as well to collect server metrics.

  3. cadvisor (metrics documentation) collects information about each container. Join these with pod metrics from kube-state-metrics for useful queries.

  4. jupyterhub (metrics documentation) collects information directly from the JupyterHubs.

  5. Other components you have installed on your cluster - like prometheus, nginx-ingress, etc - will also emit their own metrics.

Avoid double-counting container metrics

It seems that one container's resource metrics can be reported multiple times, with an empty name label and a name=k8s_... label. Because of this, if we do sum(container_resource_metric) by (pod), we will often get twice the actual resource consumption of a given pod. Since name="" is always redundant, make sure to exclude this in any query that includes a sum across container metrics. For example:

sum(
    irate(container_cpu_usage_seconds_total{name!=""}[5m])
) by (namespace, pod)