-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc. bug: Q4_0 with runtime repacking not working as expected (TYPE_Q4_0_4_4 REMOVED) #10757
Comments
Would have thought maybe you compiled wrong but I see Can you try IQ4_NL? That would should also have repacking I think |
Even I am facing the exactly same issue, I am using Raspberry Pi 5 (ARM Cortex A76) for inference. Q4_0_4_4 seems to be much faster with an older commit and the latest version does not support the same (TYPE Q4_0_4_4 REMOVED). On using Q4_0 with the latest version, runtime repacking is not working at all -
I went ahead and tested this on x86_64 CPU and to my surprise the repacking worked as expected -
I can attach complete logs if needed, however @smpurkis has already uploaded the same in this thread. |
This PR have add a control "ggml_arm_arch_features.has_dotprod" llama.cpp/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Lines 4127 to 4153 in ae4b922
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c Lines 2394 to 2405 in ae4b922
can you check if it is not wrong here: ggml_arm_arch_features.has_dotprod = !!(hwcap && HWCAP_ASIMDDP);
// =>
ggml_arm_arch_features.has_dotprod = !!(hwcap & HWCAP_ASIMDDP); |
I just tried that change from commit 92f77a6 and it is still not applying the runtime repacking on arm64 cpu. |
This indicates that llama.cpp was built without support for dotprod or i8mm, instruction sets that may be necessary to take advantage of the repacked types. |
I'm running on a single board computer which uses the Rockchip RK3588 chip. Q4_0 models are much slower than Q4_0_4_4 were, and I don't see anything in the verbose model output about repacking. I have confirmed the compiler flag Just curious, why completely remove support for Q4_0_X_X models? What's the harm in allowing both online repack and pretuned files for ARM optimizations? At runtime I know you can better ensure correct optimizations are applied for the platform being used, but for my use case fast model instantiation is a higher priority (how fast is the repack?). Seems like there should have at least been a deprecation with a warning so you can alert users and give them time to update instead of breaking their apps. |
Yeah personally I also don't love when thousands of files out in the wild are all simultaneously rendered useless with an update. I get this was bleeding edge stuff, but like mentioned online repacking is slower due to the load time increase and then additionally anyone who has a Q4_0_X_Y file has to go now find the equivalent Q4_0 and download it just to maintain functionality I can accept using online repacking for all support going forward, but removing the ability to run pre-repacked models feels off. Especially cause I would almost push for the ability to locally repack, AKA download Q4_0, run |
I guess what what is meant by "support" here? Is there added complexity or issues with allowing both? The only justification for removal I can think of is that we want to ensure optimal repacks are always used. Seems like only difference in logic if you let people run pre-repacked models is that you don't have to perform the repack on instantiation. This wouldn't add complexity or make things harder to maintain in llama.cpp, right? If repacking is very fast (say, a fraction of a second on cheap hardware), then I suppose there isn't much of a tradeoff, but I don't know whether this is the case. @bartowski1182 your suggestion of doing a build-time repack seems like a good approach for applications which are not being deployed to different hardware platforms. |
Added a note to the changelog. The goal is to reduce the number of quants that need to be made. We have several repack types and I expect that more will be added over time, and the current approach does not scale. Unfortunately a consequence of the refactor that will allow us to add more repacking types more easily is that it requires removing the tensor types, and support had to be dropped. It was going to happen sooner or later either way, so it is better to do it now. If you have an issue with the performance of online repacking, open an issue and it will be improved over time. |
What is wrong with having more quants for people to distribute if they choose? What about this doesn't scale? I think developers should be able to choose how they want to repack their models, no? If there is complexity in validating the tensor type up front couldn't we have an explicit "skip repack" option and provide metadata needed to run the gguf as it if were repacked at runtime? If devs do this incorrectly and have issues it would be on them. Something like "Warning: runtime repack was skipped, tensor types cannot be validated". The build-time repack suggestion from @bartowski1182 definitely has value and would be better for applications that need fast instantiation and know what hardware they'll be running on. |
I do agree that adding potentially 3-6 repacked varieties for potentially every quant type available and all future ones is likely not scalable I think what would be nice is similar to what I mentioned, allowing the saving of a quant type to repacked format after the fact, and then some way to recognize on load that it's a repacked model and how to load it accordingly This would allow for Q4_0_4_4 to continue existing but without needing to be manually created each time, while benefiting from improved load times for people who care about it, especially as the RAM capacities of ARM machines scale and we start loading 70B+ models |
It's not scalable if there are highly specific repacks for many different platforms and the only way to leverage them is to distribute them and have users cross reference what they need with tensor type documentation. If you can essentially "cache" what is done at the repack stage, then you don't have to distribute many tensor types. If people choose to so be it, but in llama.cpp it wouldn't matter, it would be treated as a user supplied "cache" of a repack and special load time logic wouldn't be needed. |
can you try to print the 4 values: ggml_arm_arch_features.has_neon = !!(hwcap & HWCAP_ASIMD);
ggml_arm_arch_features.has_dotprod = !!(hwcap && HWCAP_ASIMDDP);
ggml_arm_arch_features.has_i8mm = !!(hwcap2 & HWCAP2_I8MM);
ggml_arm_arch_features.has_sve = !!(hwcap & HWCAP_SVE); look like the dot support is not detected, Or it is not build with __ARM_FEATURE_DOTPROD llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c Lines 13873 to 13879 in adffa6f
https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html => |
llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt Lines 129 to 130 in adffa6f
is only added for APPLE if I'm right: llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt Lines 171 to 175 in adffa6f
Nothing is done for other case. no fp16 / no dot8 look to be checked? |
Running on commit 64ae065 Adding prints as the following
The output when I run the reproduce build and run commands in the bug report.
|
FWIW, I see exactly the same problem with my Q4_0_4_4 models on Graviton. Online repacking doesn't seem to work with the default build. CMAKE_ARGS="-DGGML_CPU_AARCH64=ON" cmake --build build --config Release system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | Are there any extra flags we should build with? |
Can't find the power for my ROCK5... so hard to test...
look good with the patch... for RK3588 and PI5 can you test build with CMAKE_ARGS="-DGGML_CPU_AARCH64=ON" cmake --build build --config Release -march=armv8.2a+dotprod But I don't know if the define __ARM_FEATURE_DOTPROD / __ARM_NEON is set. so may be print what return: int ggml_cpu_has_neon(void) ggml_arm_arch_features.has_neon = !!(hwcap & HWCAP_ASIMD);
ggml_arm_arch_features.has_dotprod = !!(hwcap & HWCAP_ASIMDDP);
ggml_arm_arch_features.has_i8mm = !!(hwcap2 & HWCAP2_I8MM);
ggml_arm_arch_features.has_sve = !!(hwcap & HWCAP_SVE);
printf("NEON: %d/%d\n", ggml_arm_arch_features.has_neon, ggml_cpu_has_neon());
printf("DotProd: %d/%d\n", ggml_arm_arch_features.has_dotprod, ggml_cpu_has_dotprod());
printf("I8MM: %d\n", ggml_arm_arch_features.has_i8mm);
printf("SVE: %d\n", ggml_arm_arch_features.has_sve); may be use If you never have activate the __ARM_FEATURE_DOTPROD even LLAMAFILE is not used with Q4_0, If someone can make the repacking work I would be curious to see the different performances (ggml / llamafile / repacking) and may be it is needed to dill with "ggml_cpu_has_dotprod/..." on other kernel (llamafile / ...)
so may be with linux you can try it |
Niether IQ4_NL_4_4 or Q4_0 formats are repacking for me on rockchip boards running debian or ubuntu |
What do you use for build? (IQ4_NL_4_4 => IQ4_NL) |
Name and Version
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-cli
Problem description & steps to reproduce
Running on my arm64 server, I updated to llama.cpp yesterday and tried to run
and got this error message, (because of this PR, #10446, this doesn't seem to be the bug that causes the slow down)
so I tried running with it with
q4_0
model, i.e../build/bin/llama-cli -m Qwen2.5-Coder-3B-Instruct.q4_0.gguf
but it is much slower. I don't believe it is using the runtime repacking, otherwise it would be as fast as previous builds runningq4_0_4_4
.To show what I mean, I run 4 different setups, llama.cpp before and after the PR #10446, running both
q4_0_4_4
andq4_0
of a model, models from https://huggingface.co/bartowski/Qwen2.5-Coder-3B-Instruct-GGUF/tree/mainllama.cpp after PR
llama.cpp before PR
Running the commands
and
Key: (prompt t/s, generation t/s)
First Bad Commit
The first bad commit seems to be
All commits checked
Commands to check a good and bad commit.
Compile by running
rm -rdf build && cmake -B build && cmake --build build --config Release -j 4
Run using
A good run looks like
a bad run looks like
Relevant log output
The text was updated successfully, but these errors were encountered: