Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod metric discontinuity #428

Open
ltm920716 opened this issue Dec 2, 2024 · 5 comments
Open

pod metric discontinuity #428

ltm920716 opened this issue Dec 2, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@ltm920716
Copy link

What is the version?

newest

What happened?

the metric of pod discontinuity like bellow:
Image

What did you expect to happen?

Continuous pod metric

What is the GPU model?

No response

What is the environment?

pod

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

No response

@ltm920716 ltm920716 added the bug Something isn't working label Dec 2, 2024
@ltm920716
Copy link
Author

Image

@nvvfedorov
Copy link
Collaborator

The dcgm-exporter works in the following way:

  1. When there aren't pods running and using the specific GPU, the dcgm-exporter returns metrics without a pod label.
  2. When a pod is running and using the specific GPU, the dcgm-exporter returns metrics with a pod label, to tell that this GPU was used by this pod.

In other words, when you see metrics, without pod label, that means GPU does not run any pod at the moment.

@ltm920716
Copy link
Author

In other words, when you see metrics, without pod label, that means GPU does not run any pod at the moment.

hi @nvvfedorov

thanks for your reply,but the pods in the image all use gpus,some are LLM services that occupy certainty gpu memory(maybe utils is not alway > 0),and some are training jobs that utils > 0 always. You can see that the metric dismiss interval,so that is the point

@nvvfedorov
Copy link
Collaborator

@ltm920716, What is your k8s-device-plugin (https://github.com/NVIDIA/k8s-device-plugin) configuration and version? The k8s-device-plugin is a source of information about mapping pods on GPUs.

You can do troubleshooting by building utility: https://github.com/k8stopologyawareschedwg/podresourcesapi-tools/tree/main. Unfortunately, kubectl doesn't provide commands to list "k8s.io/kubelet/pkg/apis/podresources/v1alpha1" API :( Then, If you have access to the K8S node, where you run the workload try to run the client on the node.

As output of the command line you should see response something like this:

{
  "pod_resources": [
    {
      "name": "cuda-vector-add",
      "namespace": "default",
      "containers": [
        {
          "name": "cuda-vector-add",
          "devices": [
            {
              "resource_name": "nvidia.com/gpu",
              "device_ids": [
                "GPU-b9f9e81b-bee7-34bc-af17-132ef6592740"
              ]
            }
          ]
        }
      ]
    }
  ]
}

I am interesting to see entries with "resource_name": "nvidia.com/gpu"....

@ltm920716
Copy link
Author

Hi @nvvfedorov

  • k8s-device-plugin version
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Containers:
  nvidia-device-plugin-ctr:
    Container ID:   containerd://efb6d73447e400fe26741f395df06945decf03d6e313cd66ff8e9589c665110e
    Image:          nvcr.io/nvidia/k8s-device-plugin:v0.14.0
    Image ID:       nvcr.io/nvidia/k8s-device-plugin@sha256:ec049661909586576f2ac8fdc05820053fe1e90d3b809abf4fa17dac540ce38b
  • k8s-device-plugin configuration
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "cbr0",
          "interface": "eth0",
          "ips": [
              "10.244.1.40"
          ],
          "mac": "d2:5d:de:8a:78:d1",
          "default": true,
          "dns": {},
          "gateway": [
              "10.244.1.1"
          ]
      }]
  creationTimestamp: "2024-11-27T06:59:30Z"
  generateName: nvidia-device-plugin-daemonset-
  labels:
    controller-revision-hash: 94b784557
    name: nvidia-device-plugin-ds
    pod-template-generation: "1"
  name: nvidia-device-plugin-daemonset-n2kxh
  namespace: kube-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: nvidia-device-plugin-daemonset
    uid: 709d49fd-b5fd-4528-8c64-0cef066cd620
  resourceVersion: "1141716"
  uid: f01f6e93-9a06-4f89-a4dc-cd1eec52c8fe
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - k8s-node2
  containers:
  - env:
    - name: FAIL_ON_INIT_ERROR
      value: "false"
    image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
    imagePullPolicy: IfNotPresent
    name: nvidia-device-plugin-ctr
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/kubelet/device-plugins
      name: device-plugin
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-mpjmz
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: k8s-node2
  preemptionPolicy: PreemptLowerPriority
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  volumes:
  - hostPath:
      path: /var/lib/kubelet/device-plugins
      type: ""
    name: device-plugin
  - name: kube-api-access-mpjmz
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-11-27T06:59:30Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-11-27T07:03:11Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-11-27T07:03:11Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-11-27T06:59:30Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://bd4c7a34a6357ef8368a013fe06e27b2bf875cbbcc756a6e659bc75a583402b5
    image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
    imageID: nvcr.io/nvidia/k8s-device-plugin@sha256:ec049661909586576f2ac8fdc05820053fe1e90d3b809abf4fa17dac540ce38b
    lastState: {}
    name: nvidia-device-plugin-ctr
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-11-27T07:03:11Z"
  hostIP: 192.168.10.231
  phase: Running
  podIP: 10.244.1.40
  podIPs:
  - ip: 10.244.1.40
  qosClass: BestEffort
  startTime: "2024-11-27T06:59:30Z"
  • dcgm version
dcgm-exporter-1732699115        default         1               2024-11-27 17:18:37.085621387 +0800 CST deployed        dcgm-exporter-3.6.1     3.6.1
Image:         nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04
  • I start a training job which use two node,each node use 2 gpu
    Image
  • on node1,exec nvidia-smi
    Image
    Image
  • on node1,exec ‘./client |jq’
    Image

and then,let us see the metric info

  • we can see pod info
    Image
    Image

and we can see that the metric is discontinuity,but the training dose always use gpu,when I use grafana,the pod metric will disappear interval

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants