Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod and Namespace Labels Missing in dcgm-exporter Metrics #411

Open
qimike opened this issue Oct 30, 2024 · 3 comments
Open

Pod and Namespace Labels Missing in dcgm-exporter Metrics #411

qimike opened this issue Oct 30, 2024 · 3 comments

Comments

@qimike
Copy link

qimike commented Oct 30, 2024

I ssue Description
I'm using the following Datadog Helm values to deploy the dcgm-exporter pod:

image:
  repository: nvcr.io/nvidia/k8s/dcgm-exporter
  pullPolicy: IfNotPresent
  tag: 3.1.8-3.1.5-ubuntu20.04

arguments: ["-m", "monitoring:datadog-dcgm-exporter-configmap"]

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""
namespaceOverride: ""

serviceAccount:
  create: true
  annotations: {}
  name:

rollingUpdate:
  maxUnavailable: 1
  maxSurge: 0

podAnnotations:
  ad.datadoghq.com/exporter.checks: |-
    {
      "dcgm": {
        "instances": [
          {
            "openmetrics_endpoint": "http://%%host%%:9400/metrics"
          }
        ]
      }
    }

podSecurityContext: {}

securityContext:
  runAsNonRoot: false
  runAsUser: 0
  capabilities:
    add: ["SYS_ADMIN"]

service:
  enable: true
  type: ClusterIP
  port: 9400
  address: ":9400"
  annotations: {}

serviceMonitor:
  enabled: false

nodeSelector:
  node-role.kubernetes.io/worker: "true"

tolerations: []

affinity: {}

extraHostVolumes: []

extraConfigMapVolumes: []

extraVolumeMounts: []

extraEnv:
  - name: DD_KUBERNETES_POD_LABELS_AS_TAGS
    value: '{"pod":"pod","namespace":"namespace"}'
  - name: NVIDIA_MIG_MONITORING
    value: "1"
  - name: DCGM_EXPORTER_KUBERNETES
    value: "true"

kubeletPath: "/var/lib/kubelet/pod-resources"

Additionally, I'm using the following ConfigMap and RBAC configuration:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: dcgm-exporter-read-datadog-cm
  namespace: monitoring
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  resourceNames: ["datadog-dcgm-exporter-configmap"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: dcgm-exporter-datadog
  namespace: monitoring
subjects:
- kind: ServiceAccount
  name: dcgm-datadog-dcgm-exporter
  namespace: monitoring
roleRef:
  kind: Role
  name: dcgm-exporter-read-datadog-cm
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: datadog-dcgm-exporter-configmap
  namespace: monitoring
data:
  metrics: |
    # Metrics configuration
    DCGM_FI_DEV_SM_CLOCK ,gauge ,SM clock frequency (in MHz).
    DCGM_FI_DEV_MEM_CLOCK ,gauge ,Memory clock frequency (in MHz).
    ...

After deploying, I noticed that the pod and namespace labels appear to be empty in the exported metrics. Here is an example metric output:
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-0f039abb-366b-4158-f72f-04a0a30cc631",device="nvidia0",modelName="NVIDIA A100-SXM4-80GB",Hostname="lambda-hyperplane01",DCGM_FI_CUDA_DRIVER_VERSION="12010",DCGM_FI_DEV_BRAND="NVIDIA",DCGM_FI_DEV_MINOR_NUMBER="2",DCGM_FI_DEV_NAME="NVIDIA A100-SXM4-80GB",DCGM_FI_DEV_SERIAL="1324521023176",DCGM_FI_DRIVER_VERSION="520.61.05",DCGM_FI_PROCESS_NAME="/usr/bin/dcgm-exporter",container="",namespace="",pod=""} 210

Could you please shed some light on where I might have missed a configuration setting to ensure that the pod and namespace labels are populated in the exporter?

@nvvfedorov
Copy link
Collaborator

@nvvfedorov
Copy link
Collaborator

Another thing, that to see not empty pods and container metrics, you should have a load (pods) running on appropriate GPUs.

@mtparet
Copy link

mtparet commented Nov 20, 2024

This feature is missing from the exporter, cf #423

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants