pod metric discontinuity #428

ltm920716 · 2024-12-02T03:47:10Z

What is the version?

newest

What happened?

the metric of pod discontinuity like bellow：

What did you expect to happen?

Continuous pod metric

What is the GPU model?

No response

What is the environment?

pod

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

No response

ltm920716 · 2024-12-02T05:58:34Z

nvvfedorov · 2024-12-02T15:41:26Z

The dcgm-exporter works in the following way:

When there aren't pods running and using the specific GPU, the dcgm-exporter returns metrics without a pod label.
When a pod is running and using the specific GPU, the dcgm-exporter returns metrics with a pod label, to tell that this GPU was used by this pod.

In other words, when you see metrics, without pod label, that means GPU does not run any pod at the moment.

ltm920716 · 2024-12-02T15:54:33Z

In other words, when you see metrics, without pod label, that means GPU does not run any pod at the moment.

hi @nvvfedorov ，

thanks for your reply，but the pods in the image all use gpus，some are LLM services that occupy certainty gpu memory（maybe utils is not alway > 0），and some are training jobs that utils > 0 always. You can see that the metric dismiss interval，so that is the point

nvvfedorov · 2024-12-02T17:37:57Z

@ltm920716, What is your k8s-device-plugin (https://github.com/NVIDIA/k8s-device-plugin) configuration and version? The k8s-device-plugin is a source of information about mapping pods on GPUs.

You can do troubleshooting by building utility: https://github.com/k8stopologyawareschedwg/podresourcesapi-tools/tree/main. Unfortunately, kubectl doesn't provide commands to list "k8s.io/kubelet/pkg/apis/podresources/v1alpha1" API :( Then, If you have access to the K8S node, where you run the workload try to run the client on the node.

As output of the command line you should see response something like this:

{
  "pod_resources": [
    {
      "name": "cuda-vector-add",
      "namespace": "default",
      "containers": [
        {
          "name": "cuda-vector-add",
          "devices": [
            {
              "resource_name": "nvidia.com/gpu",
              "device_ids": [
                "GPU-b9f9e81b-bee7-34bc-af17-132ef6592740"
              ]
            }
          ]
        }
      ]
    }
  ]
}

I am interesting to see entries with "resource_name": "nvidia.com/gpu"....

ltm920716 · 2024-12-03T06:30:41Z

Hi @nvvfedorov，

k8s-device-plugin version

Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Containers:
  nvidia-device-plugin-ctr:
    Container ID:   containerd://efb6d73447e400fe26741f395df06945decf03d6e313cd66ff8e9589c665110e
    Image:          nvcr.io/nvidia/k8s-device-plugin:v0.14.0
    Image ID:       nvcr.io/nvidia/k8s-device-plugin@sha256:ec049661909586576f2ac8fdc05820053fe1e90d3b809abf4fa17dac540ce38b

k8s-device-plugin configuration

apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "cbr0",
          "interface": "eth0",
          "ips": [
              "10.244.1.40"
          ],
          "mac": "d2:5d:de:8a:78:d1",
          "default": true,
          "dns": {},
          "gateway": [
              "10.244.1.1"
          ]
      }]
  creationTimestamp: "2024-11-27T06:59:30Z"
  generateName: nvidia-device-plugin-daemonset-
  labels:
    controller-revision-hash: 94b784557
    name: nvidia-device-plugin-ds
    pod-template-generation: "1"
  name: nvidia-device-plugin-daemonset-n2kxh
  namespace: kube-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: nvidia-device-plugin-daemonset
    uid: 709d49fd-b5fd-4528-8c64-0cef066cd620
  resourceVersion: "1141716"
  uid: f01f6e93-9a06-4f89-a4dc-cd1eec52c8fe
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - k8s-node2
  containers:
  - env:
    - name: FAIL_ON_INIT_ERROR
      value: "false"
    image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
    imagePullPolicy: IfNotPresent
    name: nvidia-device-plugin-ctr
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/kubelet/device-plugins
      name: device-plugin
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-mpjmz
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: k8s-node2
  preemptionPolicy: PreemptLowerPriority
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  volumes:
  - hostPath:
      path: /var/lib/kubelet/device-plugins
      type: ""
    name: device-plugin
  - name: kube-api-access-mpjmz
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-11-27T06:59:30Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-11-27T07:03:11Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-11-27T07:03:11Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-11-27T06:59:30Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://bd4c7a34a6357ef8368a013fe06e27b2bf875cbbcc756a6e659bc75a583402b5
    image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
    imageID: nvcr.io/nvidia/k8s-device-plugin@sha256:ec049661909586576f2ac8fdc05820053fe1e90d3b809abf4fa17dac540ce38b
    lastState: {}
    name: nvidia-device-plugin-ctr
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-11-27T07:03:11Z"
  hostIP: 192.168.10.231
  phase: Running
  podIP: 10.244.1.40
  podIPs:
  - ip: 10.244.1.40
  qosClass: BestEffort
  startTime: "2024-11-27T06:59:30Z"

dcgm version

dcgm-exporter-1732699115        default         1               2024-11-27 17:18:37.085621387 +0800 CST deployed        dcgm-exporter-3.6.1     3.6.1
Image:         nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04

I start a training job which use two node，each node use 2 gpu
on node1，exec nvidia-smi
on node1，exec ‘./client |jq’

and then，let us see the metric info

we can see pod info

and we can see that the metric is discontinuity，but the training dose always use gpu，when I use grafana，the pod metric will disappear interval

ltm920716 added the bug Something isn't working label Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pod metric discontinuity #428

pod metric discontinuity #428

ltm920716 commented Dec 2, 2024

ltm920716 commented Dec 2, 2024

nvvfedorov commented Dec 2, 2024

ltm920716 commented Dec 2, 2024

nvvfedorov commented Dec 2, 2024

ltm920716 commented Dec 3, 2024

pod metric discontinuity #428

pod metric discontinuity #428

Comments

ltm920716 commented Dec 2, 2024

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

ltm920716 commented Dec 2, 2024

nvvfedorov commented Dec 2, 2024

ltm920716 commented Dec 2, 2024

nvvfedorov commented Dec 2, 2024

ltm920716 commented Dec 3, 2024