You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
After updating to 0.3.4 ARM optimized Q4_0_4_4 models are no longer supported by llama.cpp. Instead when loading an error is thrown that "TYPE_Q4_0_4_4 REMOVED, use Q4_0 with runtime repacking". This comes from this change in llama.cpp. I believe a flag must be added to llama-cpp-python and passed down llama.cpp internal to enable this feature.
Describe the solution you'd like
Add a flag for Llama instantiation that enables runtime repack in llama.cpp
The text was updated successfully, but these errors were encountered:
After looking at the llama.cpp PR it seems like you just need to set this cmake flag to on, GGML_CPU_AARCH64 but when I tried doing CMAKE_ARGS="-DGGML_CPU_AARCH64=ON" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir the performance of the Q4_0 was far worse than Q4_0_4_4 gguf before the update, and nothing in the verbose model output suggests that this was applied.
Is your feature request related to a problem? Please describe.
After updating to 0.3.4 ARM optimized Q4_0_4_4 models are no longer supported by llama.cpp. Instead when loading an error is thrown that "TYPE_Q4_0_4_4 REMOVED, use Q4_0 with runtime repacking". This comes from this change in llama.cpp. I believe a flag must be added to llama-cpp-python and passed down llama.cpp internal to enable this feature.
Describe the solution you'd like
Add a flag for Llama instantiation that enables runtime repack in llama.cpp
The text was updated successfully, but these errors were encountered: