You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have been using gpu-operator in Kubernetes cluster. Gpu-operator helm-chart version: gpu-operator-v23.6.1 Kubernetes version: v1.26.6
I have enabled mig for one node. You can see the node labels below. I deployed a test app. Also, you can see my test app yaml.
When I port-forwarded dcgm-exporter's pod in the k8s-node-worker-2, I could see only 5 pod’s DCGM_FI_PROF_GR_ENGINE_ACTIVE metric is available.
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active (in %).
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="8",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-8mg7j"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="10",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-zlbl2"} 0.003227
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="11",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-pc27w"} 0.003653
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="12",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-gqzxm"} 0.003896
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-9e42cf09-8f38-25f9-f67c-630936298703",device="nvidia0",modelName="NVIDIA H100 NVL",GPU_I_PROFILE="1g.11gb",GPU_I_ID="13",Hostname="nvidia-dcgm-exporter-qttj5",DCGM_FI_DRIVER_VERSION="535.104.05",container="cuda-sample-vector-add",namespace="default",pod="gpu-test-59cd4d464-lt4fj"} 0.003856
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active (in %).
Usage of Pod With No Metric
root@gpu-test-59cd4d464-jdk46:/# nvidia-smi
Tue Oct 8 10:33:25 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 NVL On | 00000000:00:06.0 Off | On |
| N/A 70C P0 127W / 400W | N/A | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 7 0 0 | 20MiB / 11008MiB | 16 0 | 1 0 1 0 1 |
| | 2MiB / 7MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 7 0 2011028 C /cuda-samples/vectorAdd 10MiB |
+---------------------------------------------------------------------------------------+
What did you expect to happen?
I should see the metric for all pods.
What is the GPU model?
h100-nvl
What is the environment?
DCGM-Exporter running on the pod
How did you deploy the dcgm-exporter and what is the configuration?
I use the GPU Operator.
How to reproduce the issue?
No response
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered:
What is the version?
3.1.8-3.1.5-ubuntu20.04
What happened?
We have been using gpu-operator in Kubernetes cluster. Gpu-operator helm-chart version: gpu-operator-v23.6.1 Kubernetes version: v1.26.6
I have enabled mig for one node. You can see the node labels below. I deployed a test app. Also, you can see my test app yaml.
When I port-forwarded dcgm-exporter's pod in the k8s-node-worker-2, I could see only 5 pod’s DCGM_FI_PROF_GR_ENGINE_ACTIVE metric is available.
Some pods have no metric but when I checked them I can see the usage. Also, this problem doesn’t occur with A100-80gb card.
The Node Labels
Test App
Port-forward Metric Output
Usage of Pod With No Metric
What did you expect to happen?
I should see the metric for all pods.
What is the GPU model?
h100-nvl
What is the environment?
DCGM-Exporter running on the pod
How did you deploy the dcgm-exporter and what is the configuration?
I use the GPU Operator.
How to reproduce the issue?
No response
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered: