-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to view Gpu utilization metrics in openshift dashboard #1002
Comments
Hello Team, Any update on above issue? |
Hello Nvidia Team, Can someone please help with above? |
Hi @shivamerla |
@arpitsharma-vw was it working before and are you seeing as a regression? Is the gpu-operator/dcgm-exporter configured with custom metrics or default ones? I don't think there is any RBAC issue here with the operator/dcgm-exporter itself, as the exporter uses pod resources API which will provide metrics from all Pods using GPUs from all namespaces. Can you double check the RBAC setup in the developer mode to scrape any metrics in general? @cdesiniotis @tariq1890 can help debug further. |
@shivamerla I think it is custom one as per below. $ oc get daemonset nvidia-dcgm-exporter -o yaml |grep -i etc and below is the file sh-5.1# cat /etc/dcgm-exporter/dcgm-metrics.csv looks like DCGM_FI_DEV_GPU_UTIL metrics is not included in above file which is present in file default-counters.csv |
@shivamerla We are able to see metrics after adding below metrics to Configmap (console-plugin-nvidia-gpu) DCGM_FI_DEV_GPU_UTIL |
Environment:
Openshift version: 4.16.10
nvidia-operator- version: 24.6.1
Hello Team,
We are facing below issue:
Issue 1:
in administrator space, we are not able to view few important metrics in nvidia DCGM Exporter Dashboard such as :
1: GPU utilization
2: GPU Framebuffer Mem Used
3: Tensor Core Utilization
We are able to view few metrics such as gpu temperature etc but above metrics are much important for us.
Issue 2 : In developer space
We are not able to see any metrics in nvidia DCGM Exporter Dashboard. We are able to see few metrics in administrator space but not able to see any metrics in developer space. Is there any way we can monitor gpu utilization per namespace also so that application team can monitor gpu utilization in their namespace on their own.
Issue 3: In section compute > GPU , we are not able to see any Realtime utilization date. Every time gpu utilization metrics are showing as 0%.
I am attaching screenshots for all the issues.
The text was updated successfully, but these errors were encountered: