Add an option to enable --runtime-repack in llama.cpp #1860

ekcrisp · 2024-12-09T21:56:55Z

Is your feature request related to a problem? Please describe.
After updating to 0.3.4 ARM optimized Q4_0_4_4 models are no longer supported by llama.cpp. Instead when loading an error is thrown that "TYPE_Q4_0_4_4 REMOVED, use Q4_0 with runtime repacking". This comes from this change in llama.cpp. I believe a flag must be added to llama-cpp-python and passed down llama.cpp internal to enable this feature.

Describe the solution you'd like
Add a flag for Llama instantiation that enables runtime repack in llama.cpp

ekcrisp · 2024-12-10T04:25:02Z

After looking at the llama.cpp PR it seems like you just need to set this cmake flag to on, GGML_CPU_AARCH64 but when I tried doing CMAKE_ARGS="-DGGML_CPU_AARCH64=ON" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir the performance of the Q4_0 was far worse than Q4_0_4_4 gguf before the update, and nothing in the verbose model output suggests that this was applied.

ekcrisp · 2024-12-12T19:24:44Z

This may be a bug in llama.cpp. I will close this ticket if repack works after this issue is closed upstream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option to enable --runtime-repack in llama.cpp #1860

Add an option to enable --runtime-repack in llama.cpp #1860

ekcrisp commented Dec 9, 2024

ekcrisp commented Dec 10, 2024

ekcrisp commented Dec 12, 2024

Add an option to enable --runtime-repack in llama.cpp #1860

Add an option to enable --runtime-repack in llama.cpp #1860

Comments

ekcrisp commented Dec 9, 2024

ekcrisp commented Dec 10, 2024

ekcrisp commented Dec 12, 2024