Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All CUDA-capable devices are busy or unavailable #27

Open
DaBeIDS opened this issue Sep 18, 2023 · 2 comments
Open

All CUDA-capable devices are busy or unavailable #27

DaBeIDS opened this issue Sep 18, 2023 · 2 comments
Assignees

Comments

@DaBeIDS
Copy link
Collaborator

DaBeIDS commented Sep 18, 2023

Hi,
it seems that GPU is not fully available.

I tried on the aicoe-osc-demo names space to run infer_on_pdf.py inside of the main-terminal. The model pod should actully have a GPU attached to it and would not create itself if there is no GPU, but now i got the following error message.

image

Let me know if you need more details. I can not extract data currently.

@DaBeIDS
Copy link
Collaborator Author

DaBeIDS commented Sep 21, 2023

We figured out that the pod did not hat a resource limit and therefore did not check if it really had a GPU which it can use. Due to that a GPU node was assigned which is already in use.
A temporary solution is:
spec:
..
template:
..
containers:
...
resources:
limits:
nvidia.com/gpu: '1'
But that will fix the gpu to the pod. We have to check how we can manage to turn on and off the gpu pod when it is really triggered.

@eharrison24
Copy link

Mikhail is working on this task this week 11/27 - 12/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

3 participants