Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random CUDA errors #36

Open
mingekko opened this issue Sep 2, 2024 · 1 comment
Open

Random CUDA errors #36

mingekko opened this issue Sep 2, 2024 · 1 comment
Labels
info needed Further information is requested

Comments

@mingekko
Copy link

mingekko commented Sep 2, 2024

Hello!

About once every 2 weeks the following errors appear for a few hours and then it fixes itself:

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

I can't decide what this phenomenon is - especially since the error disappears after a few hours and then reappears after 1-2 weeks... Can anyone help me find out what is causing this?

This is my configuration:
img

@davefojtik
Copy link
Owner

That error message is often caused by an incompatible GPU driver on the machine and is usually solved by disabling Cuda Malloc. But that is disabled in Fooocus by default as far as I know.

The fact that it's appearing randomly and "fixes itself after a few hours" would suggest to be a problem with specific workers that are spawning for example when your normally used GPUs get low in availability.

I would suggest writing down what GPUs are your workers using normally (you can see what GPU it is when hovering over the rectangles representing individual workers in your endpoint details), and then checking what GPUs are being used when you encounter such an error. Alternatively, you could go over your list of secondary GPU selections and try to find the problematic one right away. But that could be time-consuming if you have many models selected since you need to change the endpoint settings, purge all the active workers to spawn new ones and test them.

I use basically just 4090s at this point (which are also the most cost-effective ones for this task) and never had such an error yet. So if you'll find how to reproduce it frequently or the GPU model that is causing this, let us know for sure.

@davefojtik davefojtik added the info needed Further information is requested label Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
info needed Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants