Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can not collect gpu utilization metric when mig enable for some pods #397

Open
melikeiremguler opened this issue Oct 8, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@melikeiremguler
Copy link

melikeiremguler commented Oct 8, 2024

What is the version?

3.1.8-3.1.5-ubuntu20.04

What happened?

We have been using gpu-operator in Kubernetes cluster. Gpu-operator helm-chart version: gpu-operator-v23.6.1 Kubernetes version: v1.26.6

I have enabled mig for one node. You can see the node labels below. I deployed a test app. Also, you can see my test app yaml.
When I port-forwarded dcgm-exporter's pod in the k8s-node-worker-2, I could see only 5 pod’s DCGM_FI_PROF_GR_ENGINE_ACTIVE metric is available.

kubectl port-forward pod/nvidia-dcgm-exporter-qttj5 9400:9400

Some pods have no metric but when I checked them I can see the usage. Also, this problem doesn’t occur with A100-80gb card.

 kubectl exec -it gpu-test-59cd4d464-jdk46 -- bash
root@gpu-test-59cd4d464-jdk46:/# nvidia-smi

The Node Labels

{
  "beta.kubernetes.io/arch": "amd64",
  "beta.kubernetes.io/os": "linux",
  "feature.node.kubernetes.io/cpu-cpuid.ADX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX2": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512BITALG": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512BW": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512CD": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512DQ": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512F": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512IFMA": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI2": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VL": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX512VPOPCNTDQ": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVXVNNIINT8": "true",
  "feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FMA3": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FSRM": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FXSR": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FXSROPT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.GFNI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBPB": "true",
  "feature.node.kubernetes.io/cpu-cpuid.IBRS": "true",
  "feature.node.kubernetes.io/cpu-cpuid.LAHF": "true",
  "feature.node.kubernetes.io/cpu-cpuid.MD_CLEAR": "true",
  "feature.node.kubernetes.io/cpu-cpuid.MOVBE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.OSXSAVE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SHA": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD": "true",
  "feature.node.kubernetes.io/cpu-cpuid.STIBP": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SYSCALL": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SYSEE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VAES": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VMX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ": "true",
  "feature.node.kubernetes.io/cpu-cpuid.WBNOINVD": "true",
  "feature.node.kubernetes.io/cpu-cpuid.X87": "true",
  "feature.node.kubernetes.io/cpu-cpuid.XGETBV1": "true",
  "feature.node.kubernetes.io/cpu-cpuid.XSAVE": "true",
  "feature.node.kubernetes.io/cpu-cpuid.XSAVEC": "true",
  "feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT": "true",
  "feature.node.kubernetes.io/cpu-cpuid.XSAVES": "true",
  "feature.node.kubernetes.io/cpu-hardware_multithreading": "false",
  "feature.node.kubernetes.io/cpu-model.family": "6",
  "feature.node.kubernetes.io/cpu-model.id": "106",
  "feature.node.kubernetes.io/cpu-model.vendor_id": "Intel",
  "feature.node.kubernetes.io/kernel-config.NO_HZ": "true",
  "feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE": "true",
  "feature.node.kubernetes.io/kernel-version.full": "5.15.0-94-generic",
  "feature.node.kubernetes.io/kernel-version.major": "5",
  "feature.node.kubernetes.io/kernel-version.minor": "15",
  "feature.node.kubernetes.io/kernel-version.revision": "0",
  "feature.node.kubernetes.io/pci-10de.present": "true",
  "feature.node.kubernetes.io/pci-1af4.present": "true",
  "feature.node.kubernetes.io/system-os_release.ID": "ubuntu",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID": "20.04",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "20",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "04",
  "k8slens-edit-resource-version": "v1",
  "kubernetes.io/arch": "amd64",
  "kubernetes.io/hostname": "k8s-node-worker-2",
  "kubernetes.io/os": "linux",
  "node-role.kubernetes.io/gpu-operator": "",
  "nvidia.com/cuda.driver.major": "535",
  "nvidia.com/cuda.driver.minor": "104",
  "nvidia.com/cuda.driver.rev": "05",
  "nvidia.com/cuda.runtime.major": "12",
  "nvidia.com/cuda.runtime.minor": "2",
  "nvidia.com/gfd.timestamp": "1727789534",
  "nvidia.com/gpu-driver-upgrade-state": "upgrade-done",
  "nvidia.com/gpu.compute.major": "9",
  "nvidia.com/gpu.compute.minor": "0",
  "nvidia.com/gpu.count": "7",
  "nvidia.com/gpu.deploy.container-toolkit": "true",
  "nvidia.com/gpu.deploy.dcgm": "true",
  "nvidia.com/gpu.deploy.dcgm-exporter": "true",
  "nvidia.com/gpu.deploy.device-plugin": "true",
  "nvidia.com/gpu.deploy.driver": "true",
  "nvidia.com/gpu.deploy.gpu-feature-discovery": "true",
  "nvidia.com/gpu.deploy.mig-manager": "true",
  "nvidia.com/gpu.deploy.node-status-exporter": "true",
  "nvidia.com/gpu.deploy.nvsm": "true",
  "nvidia.com/gpu.deploy.operator-validator": "true",
  "nvidia.com/gpu.engines.copy": "1",
  "nvidia.com/gpu.engines.decoder": "1",
  "nvidia.com/gpu.engines.encoder": "0",
  "nvidia.com/gpu.engines.jpeg": "1",
  "nvidia.com/gpu.engines.ofa": "0",
  "nvidia.com/gpu.family": "hopper",
  "nvidia.com/gpu.machine": "HPC",
  "nvidia.com/gpu.memory": "11008",
  "nvidia.com/gpu.multiprocessors": "16",
  "nvidia.com/gpu.present": "true",
  "nvidia.com/gpu.product": "NVIDIA-H100-NVL-MIG-1g.12gb",
  "nvidia.com/gpu.replicas": "1",
  "nvidia.com/gpu.slices.ci": "1",
  "nvidia.com/gpu.slices.gi": "1",
  "nvidia.com/mig.capable": "true",
  "nvidia.com/mig.config": "all-1g.12gb",
  "nvidia.com/mig.config.state": "success",
  "nvidia.com/mig.strategy": "single"
}

Test App

kind: Deployment
metadata:
  name: gpu-test
  labels:
    app: gpu-test
spec:
  replicas: 7
  selector:
    matchLabels:
      app: gpu-test
  template:
    metadata:
      labels:
        app: gpu-test
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      hostPID: true
      containers:
        - name: cuda-sample-vector-add
          image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
          command: ["/bin/bash", "-c", "--"]
          args:
            - while true; do /cuda-samples/vectorAdd; done
          resources:
           limits:
             nvidia.com/gpu: 1
      nodeSelector:
        kubernetes.io/hostname: k8s-node-worker-2

Port-forward Metric Output

# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active (in %).
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="8",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-8mg7j"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="10",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-zlbl2"} 0.003227
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="11",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-pc27w"} 0.003653
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="12",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-gqzxm"} 0.003896
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="13",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-lt4fj"} 0.003856
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active (in %).

Usage of Pod With No Metric

root@gpu-test-59cd4d464-jdk46:/# nvidia-smi
Tue Oct  8 10:33:25 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 NVL                On  | 00000000:00:06.0 Off |                   On |
| N/A   70C    P0             127W / 400W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    7   0   0  |              20MiB / 11008MiB  | 16      0 |  1   0    1    0    1 |
|                  |               2MiB /     7MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0    7    0    2011028      C   /cuda-samples/vectorAdd                      10MiB |
+---------------------------------------------------------------------------------------+

What did you expect to happen?

I should see the metric for all pods.

What is the GPU model?

h100-nvl

What is the environment?

DCGM-Exporter running on the pod

How did you deploy the dcgm-exporter and what is the configuration?

I use the GPU Operator.

How to reproduce the issue?

No response

Anything else we need to know?

No response

@melikeiremguler melikeiremguler added the bug Something isn't working label Oct 8, 2024
@Natelu
Copy link

Natelu commented Nov 5, 2024

I got same question in A100 PCIE with mig. And I can also find DCGM_FI_DEV_GPU_UTIL in /etc/dcgm-exporter/dcp-metrics-included.csv like below:

# Utilization (the sample period varies depending on the product) <br>
DCGM_FI_DEV_GPU_UTIL,  gauge, GPU utilization (in %)  
....

and it still have not GPU_UTIL metric from dcgm-expoter.

dcgm-exporter version

dcgm-exporter:3.3.8-3.6.0-ubuntu22.04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants