GPU Metrics Exporter

A tool to collect GPU metrics from DCGM Exporter instasnces, and forward them to cast.ai.

How it works

The exporter can run as a sidecar to the DCGM DaemonSet, or as a single instance service in the cluster. When it runs as a sidecar, the DCGM_HOST should be set. In this case it will only scrape metrics from that particular instance of DCGM and send them to cast.ai

If it is deployed as a single instance in the cluster, it will automatically discover the DCGM instances and scrape the metrics from them. If the DCGM instances have some custom labels, make sure to properly set the DCGM_LABELS environment variable.

It is also possible to deploy the DCGM exporter but have it configured to read the metrics from an existing nv-hostengine.

Scraped metrics

Make sure that these fields are exposed by DCGM exporter as metrics:

DCGM_FI_PROF_SM_ACTIVE
DCGM_FI_PROF_SM_OCCUPANCY
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
DCGM_FI_PROF_DRAM_ACTIVE
DCGM_FI_PROF_PCIE_TX_BYTES
DCGM_FI_PROF_PCIE_RX_BYTES
DCGM_FI_PROF_GR_ENGINE_ACTIVE
DCGM_FI_DEV_FB_TOTAL
DCGM_FI_DEV_FB_FREE
DCGM_FI_DEV_FB_USED
DCGM_FI_DEV_PCIE_LINK_GEN
DCGM_FI_DEV_PCIE_LINK_WIDTH
DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_MEMORY_TEMP
DCGM_FI_DEV_POWER_USAGE

Installation

Helm

Cloning this repository

You can clone this repository and install the chart with the following commands:

$ cd charts/gpu-metrics-exporter
$ helm install --generate-name <deployment-name> -f values.yaml -f values-<k8s-provider>.yaml .

Where:

<deployment-name> is a name of your choice
<k8s-provider> is the name of the k8s provider you are using (e.g. eks, gke, aks)
- this sets the proper node affinity so the Daemon Set only runs on nodes with GPUs

Adding the cast.ai repository

You can add the cast.ai repository and install the chart with the following commands:

$ helm repo add castai https://castai.github.io/charts
$ helm repo update
$ helm pull castai/gpu-metrics-exporter --untar
$ cd gpu-metrics-exporter
$ helm install --generate-name castai/gpu-metrics-exporter -f values.yaml -f values-<k8s-provider>.yaml

Configuring the installation

By default, it will be deployed as a sidecar to the DCGM exporter. If you don't want to deploy it as a sidecar, in the values.yaml file you can:

Set dcgmExporter.enabled to false
Set the DCGM_HOST and DCGM_LABELS environment variables in gpuMetricsExporter.config of the values.yaml file
1. DCGM_HOST is the address of the DCGM exporter instance
2. DCGM_LABELS is a comma-separated list of labels that the DCGM instances have
If you want to deploy the DCGM exporter but have it configured to read the metrics from an existing nv-hostengine, you can:
1. set the dcgmExporter.useExternalHostEngine to true in the values.yaml file
2. it will try to connect to the 5555 port of the node.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github/workflows		.github/workflows
charts/gpu-metrics-exporter		charts/gpu-metrics-exporter
cmd		cmd
internal		internal
mock		mock
pb		pb
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
.mockery.yaml		.mockery.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cr.yaml		cr.yaml
gen_mockery.go		gen_mockery.go
go.mod		go.mod
go.sum		go.sum
kube-linter-config.yaml		kube-linter-config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU Metrics Exporter

How it works

Scraped metrics

Installation

Helm

Cloning this repository

Adding the cast.ai repository

Configuring the installation

About

Releases 15

Packages

Contributors 7

Languages

License

castai/gpu-metrics-exporter

Folders and files

Latest commit

History

Repository files navigation

GPU Metrics Exporter

How it works

Scraped metrics

Installation

Helm

Cloning this repository

Adding the cast.ai repository

Configuring the installation

About

Resources

License

Stars

Watchers

Forks

Releases 15

Packages 0

Contributors 7

Languages

Packages