Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to enable --runtime-repack in llama.cpp #1860

Open
ekcrisp opened this issue Dec 9, 2024 · 2 comments
Open

Add an option to enable --runtime-repack in llama.cpp #1860

ekcrisp opened this issue Dec 9, 2024 · 2 comments

Comments

@ekcrisp
Copy link

ekcrisp commented Dec 9, 2024

Is your feature request related to a problem? Please describe.
After updating to 0.3.4 ARM optimized Q4_0_4_4 models are no longer supported by llama.cpp. Instead when loading an error is thrown that "TYPE_Q4_0_4_4 REMOVED, use Q4_0 with runtime repacking". This comes from this change in llama.cpp. I believe a flag must be added to llama-cpp-python and passed down llama.cpp internal to enable this feature.

Describe the solution you'd like
Add a flag for Llama instantiation that enables runtime repack in llama.cpp

@ekcrisp
Copy link
Author

ekcrisp commented Dec 10, 2024

After looking at the llama.cpp PR it seems like you just need to set this cmake flag to on, GGML_CPU_AARCH64 but when I tried doing CMAKE_ARGS="-DGGML_CPU_AARCH64=ON" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir the performance of the Q4_0 was far worse than Q4_0_4_4 gguf before the update, and nothing in the verbose model output suggests that this was applied.

@ekcrisp
Copy link
Author

ekcrisp commented Dec 12, 2024

This may be a bug in llama.cpp. I will close this ticket if repack works after this issue is closed upstream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant