Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCGM_FI_DEV_GPU_UTIL abnormal point #418

Open
dafu-wu opened this issue Nov 17, 2024 · 2 comments
Open

DCGM_FI_DEV_GPU_UTIL abnormal point #418

dafu-wu opened this issue Nov 17, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@dafu-wu
Copy link

dafu-wu commented Nov 17, 2024

What is the version?

nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04

What happened?

DCGM_FI_DEV_GPU_UTIL{DCGM_FI_DRIVER_VERSION="550.90.07",Hostname="node-3",UUID="GPU-xxxxx-35bc-edfc-df1d-b3d3145daba0",container="pytorch",device="nvidia2",gpu="2",instance="10.10.11.11:9400",job="gpu-metrics",kubernetes_node="node-3",modelName="NVIDIA H100 80GB HBM3",namespace="zlm",pod="pytorchjob-worker-2"} 113522

Image

There are abnormal points as shown above. I don’t know what causes the above phenomenon.How to troubleshoot?

What did you expect to happen?

0~100

What is the GPU model?

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
| N/A 65C P0 650W / 700W | 62844MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 |
| N/A 55C P0 654W / 700W | 62768MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 |
| N/A 56C P0 650W / 700W | 62844MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 |
| N/A 65C P0 652W / 700W | 62768MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 |
| N/A 66C P0 634W / 700W | 62844MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 |
| N/A 54C P0 630W / 700W | 62732MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 |
| N/A 65C P0 663W / 700W | 62824MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 |
| N/A 53C P0 626W / 700W | 62748MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|

What is the environment?

pod

How did you deploy the dcgm-exporter and what is the configuration?

use daemonset to deploy it in k8s

How to reproduce the issue?

No response

Anything else we need to know?

No response

@dafu-wu dafu-wu added the bug Something isn't working label Nov 17, 2024
@nvvfedorov
Copy link
Collaborator

Thank you for the reporting the issue. Please troubleshoot the issue on your environment, by using command line: dcgmi dmon -e 203 -i 2, where:

-e 230 is a DCGM_FI_DEV_GPU_UTIL metric;
-i 2 is a GPU with GPU ID = 2

We need to understand is it issue in on the DCGM-exporter or DCGM side. Thank you in advance.

@dafu-wu
Copy link
Author

dafu-wu commented Nov 24, 2024

Image
@nvvfedorov Thank you reply, found this issue,I use nvidia's gpu operator to deploy nvidia-dcgm-exporter. Is this problem a deployment configuration issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants