A tool to collect GPU metrics from DCGM Exporter instasnces, and forward them to cast.ai.
The exporter can run as a sidecar to the DCGM DaemonSet, or as a single instance service in the cluster.
When it runs as a sidecar, the DCGM_HOST
should be set. In this case it will only scrape metrics from that particular
instance of DCGM and send them to cast.ai
If it is deployed as a single instance in the cluster, it will automatically discover the DCGM
instances and scrape
the metrics from them. If the DCGM
instances have some custom labels, make sure to properly set the DCGM_LABELS
environment variable.
It is also possible to deploy the DCGM exporter but have it configured to read the metrics from an existing nv-hostengine.
Make sure that these fields are exposed by DCGM exporter as metrics:
DCGM_FI_PROF_SM_ACTIVE
DCGM_FI_PROF_SM_OCCUPANCY
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
DCGM_FI_PROF_DRAM_ACTIVE
DCGM_FI_PROF_PCIE_TX_BYTES
DCGM_FI_PROF_PCIE_RX_BYTES
DCGM_FI_PROF_GR_ENGINE_ACTIVE
DCGM_FI_DEV_FB_TOTAL
DCGM_FI_DEV_FB_FREE
DCGM_FI_DEV_FB_USED
DCGM_FI_DEV_PCIE_LINK_GEN
DCGM_FI_DEV_PCIE_LINK_WIDTH
DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_MEMORY_TEMP
DCGM_FI_DEV_POWER_USAGE
You can clone this repository and install the chart with the following commands:
$ cd charts/gpu-metrics-exporter
$ helm install --generate-name <deployment-name> -f values.yaml -f values-<k8s-provider>.yaml .
Where:
<deployment-name>
is a name of your choice<k8s-provider>
is the name of the k8s provider you are using (e.g.eks
,gke
,aks
)- this sets the proper node affinity so the Daemon Set only runs on nodes with GPUs
You can add the cast.ai repository and install the chart with the following commands:
$ helm repo add castai https://castai.github.io/charts
$ helm repo update
$ helm pull castai/gpu-metrics-exporter --untar
$ cd gpu-metrics-exporter
$ helm install --generate-name castai/gpu-metrics-exporter -f values.yaml -f values-<k8s-provider>.yaml
By default, it will be deployed as a sidecar to the DCGM exporter. If you don't want to deploy it as a sidecar, in the values.yaml file you can:
- Set
dcgmExporter.enabled
to false - Set the
DCGM_HOST
andDCGM_LABELS
environment variables ingpuMetricsExporter.config
of the values.yaml fileDCGM_HOST
is the address of the DCGM exporter instanceDCGM_LABELS
is a comma-separated list of labels that the DCGM instances have
- If you want to deploy the DCGM exporter but have it configured to read the metrics from an existing nv-hostengine,
you can:
- set the
dcgmExporter.useExternalHostEngine
to true in the values.yaml file - it will try to connect to the 5555 port of the node.
- set the