All CUDA-capable devices are busy or unavailable #27

DaBeIDS · 2023-09-18T18:27:05Z

Hi,
it seems that GPU is not fully available.

I tried on the aicoe-osc-demo names space to run infer_on_pdf.py inside of the main-terminal. The model pod should actully have a GPU attached to it and would not create itself if there is no GPU, but now i got the following error message.

Let me know if you need more details. I can not extract data currently.

DaBeIDS · 2023-09-21T13:12:37Z

We figured out that the pod did not hat a resource limit and therefore did not check if it really had a GPU which it can use. Due to that a GPU node was assigned which is already in use.
A temporary solution is:
spec:
..
template:
..
containers:
...
resources:
limits:
nvidia.com/gpu: '1'
But that will fix the gpu to the pod. We have to check how we can manage to turn on and off the gpu pod when it is really triggered.

eharrison24 · 2023-11-27T18:07:15Z

Mikhail is working on this task this week 11/27 - 12/2

DaBeIDS assigned redmikhail Sep 18, 2023

HeatherAck added this to Data Commons - Q4 2023 Sep 20, 2023

HeatherAck moved this to In Progress in Data Commons - Q4 2023 Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All CUDA-capable devices are busy or unavailable #27

All CUDA-capable devices are busy or unavailable #27

DaBeIDS commented Sep 18, 2023

DaBeIDS commented Sep 21, 2023

eharrison24 commented Nov 27, 2023

All CUDA-capable devices are busy or unavailable #27

All CUDA-capable devices are busy or unavailable #27

Comments

DaBeIDS commented Sep 18, 2023

DaBeIDS commented Sep 21, 2023

eharrison24 commented Nov 27, 2023