-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DCGM-exporter pods stuck in Running State, Not getting Ready without GPU allocation. #385
Comments
@rohitreddy1698 , Please enable the debug mode in the DCGM-exporter by setting the environment variable: "DCGM_EXPORTER_DEBUG = true", then please share logs with us. |
@nvvfedorov, Hi Sure. Here are the logs post setting the DEBUG variable to true.
The logs from the dcgm-exporter pods :
|
The error "Cannot perform the requested operation because NVML doesn't exist on this system." tell us that is something wrong with your K8S node configuration. Did you installed (NVIDIA container toolkit)[https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html]? |
Yes i have installed the NVIDIA container toolkit. I already have pods using the GPU.
Also one more confirmation is that if I assign GPU resources to the DCGM exporter pods they are working fine. |
@nvvfedorov , were did you have a chance to take a look at this? Thanks, |
You may need to specify |
I'm having the same issue. GKE cluster, with a V100 GPU. extraEnv:
- name: DCGM_EXPORTER_DEBUG
value: "true"
- name: DCGM_EXPORTER_INTERVAL
value: "10000"
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: present
effect: NoSchedule
I ran
I'm using Google's autoprovisioned driver daemonsets, which did work for this node. A few select log statements:
Can you help me figure out where to go next? There is no |
@rohitreddy1698 @petewall not sure if this will help with your issue, but the extra Helm values in this guide proved to be the solution.
My GPU node pool is running with COS and my drivers were installed manually by provisioning this DaemonSet here. |
@andehpants Thank you for your search skills! That article is very good. I also had to add a new priority class because system-node-critical was full. Here's that for completeness since that's how i ended up in this thread myself.
|
@andehpants , @archae0pteryx Thank you for your finding! What would you suggested to add to readme file to help other users? |
@nvvfedorov Not sure really. Adjacent to the initial impetus of this issue, I would certainly add a little note about the priorityclass and an example of how / why you might need to create one? TBH, i had never worked with priorityclasses nor knew of their existence... I have my CKA even. 🙃 |
Hi @archae0pteryx , Thank you for the information !
|
Ask your question
Hi Team,
I am using the dcgm-exporter, installed as a Helm Chart.
I am using the default values.
I have other Milvus component pods : query node and index node successfully scheduled and in READY state running on GPU pods.
Logging onto pod and running
nvidia-smi
command is successful.But the dcgm-exporter daemon set pods are stuck in error state :
`
➜ VectorDBBench git:(main) ✗ kubectl get pods
NAME READY STATUS RESTARTS AGE
dcgm-exporter-btsln 0/1 Running 0 46s
dcgm-exporter-c8gpg 0/1 Running 0 46s
dcgm-exporter-f9jd7 0/1 Running 0 46s
dcgm-exporter-xhs2v 0/1 Running 0 46s
dcgm-exporter-z4pz4 0/1 Running 0 46s
dcgm-exporter-zh854 0/1 Running 0 46s
vectordbbench-deployment-cf7974db6-r5scd 1/1 Running 0 41h
➜ VectorDBBench git:(main) ✗
➜ VectorDBBench git:(main) ✗ kubectl logs dcgm-exporter-btsln
2024/09/03 10:31:11 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
time="2024-09-03T10:31:11Z" level=info msg="Starting dcgm-exporter"
time="2024-09-03T10:31:11Z" level=info msg="DCGM successfully initialized!"
time="2024-09-03T10:31:12Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-09-03T10:31:12Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 19 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-09-03T10:31:12Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2024-09-03T10:31:12Z" level=info msg="Initializing system entities of type: GPU"
time="2024-09-03T10:31:12Z" level=info msg="Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system."
time="2024-09-03T10:31:12Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-09-03T10:31:12Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-09-03T10:31:12Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-09-03T10:31:12Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-09-03T10:31:12Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-09-03T10:31:12Z" level=info msg="Pipeline starting"
time="2024-09-03T10:31:12Z" level=info msg="Starting webserver"
time="2024-09-03T10:31:12Z" level=info msg="Listening on" address="[::]:9400"
time="2024-09-03T10:31:12Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false
`
But on assigning the GPU as a resource to it, getting deployed successfully :
`
➜ VectorDBBench git:(main) ✗ cat custom_values.yaml
resources:
limits:
nvidia.com/gpu: "1"
➜ VectorDBBench git:(main) ✗
➜ VectorDBBench git:(main) ✗ kubectl get pods
NAME READY STATUS RESTARTS AGE
dcgm-exporter-8ds87 1/1 Running 0 4m16s
dcgm-exporter-8qd48 1/1 Running 0 4m16s
dcgm-exporter-d9hq7 1/1 Running 0 4m16s
dcgm-exporter-hsbbq 1/1 Running 0 4m16s
dcgm-exporter-t49tt 1/1 Running 0 4m16s
dcgm-exporter-xq57b 1/1 Running 0 4m16s
vectordbbench-deployment-cf7974db6-r5scd 1/1 Running 0 41h
➜ VectorDBBench git:(main) ✗
➜ VectorDBBench git:(main) ✗ kubectl logs dcgm-exporter-8ds87
2024/09/03 10:03:35 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
time="2024-09-03T10:03:35Z" level=info msg="Starting dcgm-exporter"
time="2024-09-03T10:03:35Z" level=info msg="DCGM successfully initialized!"
time="2024-09-03T10:03:35Z" level=info msg="Collecting DCP Metrics"
time="2024-09-03T10:03:35Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-09-03T10:03:35Z" level=info msg="Initializing system entities of type: GPU"
time="2024-09-03T10:03:35Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-09-03T10:03:35Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-09-03T10:03:35Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-09-03T10:03:35Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-09-03T10:03:35Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-09-03T10:03:35Z" level=info msg="Pipeline starting"
time="2024-09-03T10:03:35Z" level=info msg="Starting webserver"
time="2024-09-03T10:03:35Z" level=info msg="Listening on" address="[::]:9400"
time="2024-09-03T10:03:35Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false
➜ VectorDBBench git:(main) ✗
`
But this is blocking other services from using GPU and dedicating it to the exporter.
Thanks,
Rohit Mothe
The text was updated successfully, but these errors were encountered: