Pod and Namespace Labels Missing in dcgm-exporter Metrics #411

qimike · 2024-10-30T20:19:21Z

I ssue Description
I'm using the following Datadog Helm values to deploy the dcgm-exporter pod:

image:
  repository: nvcr.io/nvidia/k8s/dcgm-exporter
  pullPolicy: IfNotPresent
  tag: 3.1.8-3.1.5-ubuntu20.04

arguments: ["-m", "monitoring:datadog-dcgm-exporter-configmap"]

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""
namespaceOverride: ""

serviceAccount:
  create: true
  annotations: {}
  name:

rollingUpdate:
  maxUnavailable: 1
  maxSurge: 0

podAnnotations:
  ad.datadoghq.com/exporter.checks: |-
    {
      "dcgm": {
        "instances": [
          {
            "openmetrics_endpoint": "http://%%host%%:9400/metrics"
          }
        ]
      }
    }

podSecurityContext: {}

securityContext:
  runAsNonRoot: false
  runAsUser: 0
  capabilities:
    add: ["SYS_ADMIN"]

service:
  enable: true
  type: ClusterIP
  port: 9400
  address: ":9400"
  annotations: {}

serviceMonitor:
  enabled: false

nodeSelector:
  node-role.kubernetes.io/worker: "true"

tolerations: []

affinity: {}

extraHostVolumes: []

extraConfigMapVolumes: []

extraVolumeMounts: []

extraEnv:
  - name: DD_KUBERNETES_POD_LABELS_AS_TAGS
    value: '{"pod":"pod","namespace":"namespace"}'
  - name: NVIDIA_MIG_MONITORING
    value: "1"
  - name: DCGM_EXPORTER_KUBERNETES
    value: "true"

kubeletPath: "/var/lib/kubelet/pod-resources"

Additionally, I'm using the following ConfigMap and RBAC configuration:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: dcgm-exporter-read-datadog-cm
  namespace: monitoring
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  resourceNames: ["datadog-dcgm-exporter-configmap"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: dcgm-exporter-datadog
  namespace: monitoring
subjects:
- kind: ServiceAccount
  name: dcgm-datadog-dcgm-exporter
  namespace: monitoring
roleRef:
  kind: Role
  name: dcgm-exporter-read-datadog-cm
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: datadog-dcgm-exporter-configmap
  namespace: monitoring
data:
  metrics: |
    # Metrics configuration
    DCGM_FI_DEV_SM_CLOCK ,gauge ,SM clock frequency (in MHz).
    DCGM_FI_DEV_MEM_CLOCK ,gauge ,Memory clock frequency (in MHz).
    ...

After deploying, I noticed that the pod and namespace labels appear to be empty in the exported metrics. Here is an example metric output:
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-0f039abb-366b-4158-f72f-04a0a30cc631",device="nvidia0",modelName="NVIDIA A100-SXM4-80GB",Hostname="lambda-hyperplane01",DCGM_FI_CUDA_DRIVER_VERSION="12010",DCGM_FI_DEV_BRAND="NVIDIA",DCGM_FI_DEV_MINOR_NUMBER="2",DCGM_FI_DEV_NAME="NVIDIA A100-SXM4-80GB",DCGM_FI_DEV_SERIAL="1324521023176",DCGM_FI_DRIVER_VERSION="520.61.05",DCGM_FI_PROCESS_NAME="/usr/bin/dcgm-exporter",container="",namespace="",pod=""} 210

Could you please shed some light on where I might have missed a configuration setting to ensure that the pod and namespace labels are populated in the exporter?

The text was updated successfully, but these errors were encountered:

nvvfedorov · 2024-10-30T21:19:32Z

@qimike , Did you install K8S Device plugin: https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#enabling-gpu-support-in-kubernetes ?

nvvfedorov · 2024-10-30T22:34:32Z

Another thing, that to see not empty pods and container metrics, you should have a load (pods) running on appropriate GPUs.

mtparet · 2024-11-20T12:58:17Z

This feature is missing from the exporter, cf #423

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod and Namespace Labels Missing in dcgm-exporter Metrics #411

Pod and Namespace Labels Missing in dcgm-exporter Metrics #411

qimike commented Oct 30, 2024

nvvfedorov commented Oct 30, 2024

nvvfedorov commented Oct 30, 2024

mtparet commented Nov 20, 2024

Pod and Namespace Labels Missing in dcgm-exporter Metrics #411

Pod and Namespace Labels Missing in dcgm-exporter Metrics #411

Comments

qimike commented Oct 30, 2024

nvvfedorov commented Oct 30, 2024

nvvfedorov commented Oct 30, 2024

mtparet commented Nov 20, 2024