-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pod metric discontinuity #428
Comments
The dcgm-exporter works in the following way:
In other words, when you see metrics, without pod label, that means GPU does not run any pod at the moment. |
hi @nvvfedorov , thanks for your reply,but the pods in the image all use gpus,some are LLM services that occupy certainty gpu memory(maybe utils is not alway > 0),and some are training jobs that utils > 0 always. You can see that the metric dismiss interval,so that is the point |
@ltm920716, What is your k8s-device-plugin (https://github.com/NVIDIA/k8s-device-plugin) configuration and version? The k8s-device-plugin is a source of information about mapping pods on GPUs. You can do troubleshooting by building utility: https://github.com/k8stopologyawareschedwg/podresourcesapi-tools/tree/main. Unfortunately, kubectl doesn't provide commands to list "k8s.io/kubelet/pkg/apis/podresources/v1alpha1" API :( Then, If you have access to the K8S node, where you run the workload try to run the client on the node. As output of the command line you should see response something like this:
I am interesting to see entries with "resource_name": "nvidia.com/gpu".... |
Hi @nvvfedorov,
and then,let us see the metric info and we can see that the metric is discontinuity,but the training dose always use gpu,when I use grafana,the pod metric will disappear interval |
What is the version?
newest
What happened?
the metric of pod discontinuity like bellow:
What did you expect to happen?
Continuous pod metric
What is the GPU model?
No response
What is the environment?
pod
How did you deploy the dcgm-exporter and what is the configuration?
No response
How to reproduce the issue?
No response
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered: