-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for ggllm.cpp #3357
Comments
Just my two cents: I personally don't recommend to down anything to 2 bits lower than 8 bits. The AI model degrades significantly. |
Not technically a duplicate, but close enough to: #3351 CTransformers is what would be used if ggllm.cpp were to be integrated. |
Not sure I understood you correctly. I am using the 3-bit quantized version ( |
3bit is worth using only on the larger model sizes where it makes less of a difference. 30B+. It does lower output quality significantly, but is still worth using if it allows you to use a larger model. 2bit is pretty much useless though. |
Try 2bit model with moderate difficulty coding task and you will see what he is talking about. Anything that requires precision is out of question. It is good that for your use case a lower quantization still produces output that makes sense. |
ctransformers is on my radar, I'll merge one of the open PRs adding support soon. It's always a challenge to add new backends because they usually don't come with precompiled wheels. |
I'm currently in the process of building pre-compiled wheels for CUDA 11.7. Fortunately, ctransformers already handles CUDA and non-CUDA builds internally, so a separate package won't be needed like with llama-cpp-python. |
That's very nice to hear @jllllll. |
This wheel includes CUDA binaries for both Windows and Linux. MacOS is also supported through non-CUDA binaries. |
+1 on this request, mostly for running Falcon 40B quantized in GGML (in my case, on Apple Silicon).
I note #3351 and #3313 are now done, so does that mean that this is now working, or just that it's unblocked? |
It should work. |
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment. |
Still doesn't work for me, I'm on Macbook Pro M2 Max (Apple Silicon): Installing collected packages: ctransformers Then when trying to load the model after restarting the server: |
Might be because it's tied to AVX2: |
Falcon is one of the very few good multilingual models. Support for the Falcon family models (7b / 40b) in text-generation-webui is currently very limited (4 bit only, bad performance) through AutoGPTQ. Also it needs at least 35 GB of VRAM.
ggllm.cpp is optimized for running quantized versions of those models and runs much faster. Also, it supports down to 2 bit quantized versions, which allows using the 40b model on a single 24 GB GPU.
The text was updated successfully, but these errors were encountered: