Random CUDA errors #36

mingekko · 2024-09-02T13:04:45Z

Hello!

About once every 2 weeks the following errors appear for a few hours and then it fixes itself:

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

I can't decide what this phenomenon is - especially since the error disappears after a few hours and then reappears after 1-2 weeks... Can anyone help me find out what is causing this?

This is my configuration:

davefojtik · 2024-09-02T20:05:38Z

That error message is often caused by an incompatible GPU driver on the machine and is usually solved by disabling Cuda Malloc. But that is disabled in Fooocus by default as far as I know.

The fact that it's appearing randomly and "fixes itself after a few hours" would suggest to be a problem with specific workers that are spawning for example when your normally used GPUs get low in availability.

I would suggest writing down what GPUs are your workers using normally (you can see what GPU it is when hovering over the rectangles representing individual workers in your endpoint details), and then checking what GPUs are being used when you encounter such an error. Alternatively, you could go over your list of secondary GPU selections and try to find the problematic one right away. But that could be time-consuming if you have many models selected since you need to change the endpoint settings, purge all the active workers to spawn new ones and test them.

I use basically just 4090s at this point (which are also the most cost-effective ones for this task) and never had such an error yet. So if you'll find how to reproduce it frequently or the GPU model that is causing this, let us know for sure.

davefojtik added the info needed Further information is requested label Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random CUDA errors #36

Random CUDA errors #36

mingekko commented Sep 2, 2024

davefojtik commented Sep 2, 2024

Random CUDA errors #36

Random CUDA errors #36

Comments

mingekko commented Sep 2, 2024

davefojtik commented Sep 2, 2024