-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
So SLOW with NVidia GPU and Codestral model #6326
Comments
Models in GGUF format are processed by the subsystem llama.cpp. And llama.cpp doesn't (fully) support Codestral, as you can read at ggml-org/llama.cpp#8519 A cite from the comment from compilade from August 17.:
When Codestral, is supported in llama.cpp it will take some more time until that version of llama.cpp is integrated into TGW. |
@dlippold Note that the model referred here is not In this case, maybe WSL2 adds too much overhead, but I don't know what I'm talking about here. |
When I ask the chat any question the performance is not very good. It takes a few seconds per word generated. I also see that the GPU hits the roof. So my initial thought was that a 16Gb Geforce RTX 4060 Ti is not sufficient for the job. However If I minimize the browser window the usage of the GPU goes down. If I open it again a few seconds later I see the full answer to the question. It seems like most of the resources is consumed to render the web page. Not for the AI. |
Describe the bug
For whatever reason, this method runs SO SLOW in WSL2. I followed all the instructions to run a Codestral model, but it just runs at the speed of smell. It seems to be using my CPU and GPU, so I'm not sure how it'd be so slow compared to when I manually set it up (which takes hours and still doesn't fully work right). Surely the trade-off of ease can't be this much speed.
Is there an existing issue for this?
Reproduction
Install using the start_wsl.bat
Screenshot
No response
Logs
System Info
The text was updated successfully, but these errors were encountered: