You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
What is the environment?
pod
How did you deploy the dcgm-exporter and what is the configuration?
use daemonset to deploy it in k8s
How to reproduce the issue?
No response
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered:
@nvvfedorov Thank you reply, found this issue,I use nvidia's gpu operator to deploy nvidia-dcgm-exporter. Is this problem a deployment configuration issue?
What is the version?
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
What happened?
DCGM_FI_DEV_GPU_UTIL{DCGM_FI_DRIVER_VERSION="550.90.07",Hostname="node-3",UUID="GPU-xxxxx-35bc-edfc-df1d-b3d3145daba0",container="pytorch",device="nvidia2",gpu="2",instance="10.10.11.11:9400",job="gpu-metrics",kubernetes_node="node-3",modelName="NVIDIA H100 80GB HBM3",namespace="zlm",pod="pytorchjob-worker-2"} 113522
There are abnormal points as shown above. I don’t know what causes the above phenomenon.How to troubleshoot?
What did you expect to happen?
0~100
What is the GPU model?
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
| N/A 65C P0 650W / 700W | 62844MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 |
| N/A 55C P0 654W / 700W | 62768MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 |
| N/A 56C P0 650W / 700W | 62844MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 |
| N/A 65C P0 652W / 700W | 62768MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 |
| N/A 66C P0 634W / 700W | 62844MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 |
| N/A 54C P0 630W / 700W | 62732MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 |
| N/A 65C P0 663W / 700W | 62824MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 |
| N/A 53C P0 626W / 700W | 62748MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
What is the environment?
pod
How did you deploy the dcgm-exporter and what is the configuration?
use daemonset to deploy it in k8s
How to reproduce the issue?
No response
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered: