Replies: 6 comments
-
Hey Nexes the Old, did you try |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Hey IK, I was about to answer you, but of course, you made some magic happen already. Fantastic work, as always. A new SOTA 4.25BPW GGML_TYPE quant is a huge boost. Can it be integrated in the official LlamaCPP by moving the relevant section of your ik files in the traditionnal equivalents in LCPP official? As for quant mixes, on LCPP official, I passed attn_v in Q6_K and attn_K in Q5_K for my >IQ3_M and IQ4_XS mixes when vocab is above 128000. The ppl usually drops by more than 0.01, I suspect it might help other indicators even more, for 180MB on Llama 3 70b and ulterior, that's a good trade. I also generally beef up to the higher quant the first and last layers attn_k, attn_q, and ffns in all cases, because they are either the closest from embeddings (as you were doing already on several quant mixes), or the last ones before the final output. I use an equivalent IQ3_XXL mix to your IQ3_KL. on the top of a bumped ffn_down, I'll bump ffn_up more than ffn_gate to see if it brings a bonus compared to equalizing them, I used several variants of your more_bits function to achieve steps of 12.5% layers quantized to the higher quant accordingly to my needs. What I was wondering about is a LCCP official mergeable IQ4_XXS / IQ4_K_"XXS" GGML type (tensor level quant), at 4-4.0625bpw, if such thing is possible and viable compared to a IQ3/IQ4 mix, to get rid of the IQ3_S I'm using, because on some models they are worst than Q3_K (Miqu attn_q and attn_output, for example, I observed some discrepancy on Qwen2 72b as well). I speak about LCPP official, because I was.. unable to compile IK_Llama on MSVS, and I need official as the base for my fork of KoboldCPP, the inference software I modified and use with everything, rebasing it on your IK LLama while I can't even compile it seems unviable to me. Moreover, I do not know your personal objectives nor relations with the LCPP official project, but a broad compatibility for your quants would allow people to.. use them, and not waste compute, energy, and time on non-SOTA quants for their models. |
Beta Was this translation helpful? Give feedback.
-
The license is MIT, so obviously it can be integrated into mainline
You could have opened an issue, no? With the output of the build process. I don't have access to a Windows box and Windows is certainly not my priority, but sometimes one can fix it just from the compiler error messages.
My personal objective is to have fun 😃 Quants are kind of orphaned in mainline and have become a "commodity", with tons of low quality quantized models being distributed on HuggingFace as GGUFs. Hence, people interested in (high quality) quantization work are better off here than mainline. Or people running on the CPU. Or people using models that run much faster here than in mainline also on the GPU (e.g., Gemma), etc. I do sync with mainline from time to time, but I did not see anything worth merging since I last synced in August. Am I missing something from mainline that you find essential?
Sure, one can spend a lot of time experimenting. I see your PR 8917 in mainline has not been merged. As I believe that having a more flexible and convenient way to specify quantization mixes is definitely worth having, your PR is likely to be more successful here than there. |
Beta Was this translation helpful? Give feedback.
-
I submitted my PR 8917 here, as invited to. As for mainline, there's nothing essential for me since august, aside for maintaining some sort of compatibility with KCPP so I can attempt a rebase on your fork without breaking my head too hard, even if that might still be too hard. :D A PR maybe worth testing is this one, with several percents boost in PP & TG on my side on Cuda : ggerganov/llama.cpp#8366 For the compile problem, I could have opened an issue but I was a bit discouraged by the idea that I could not even use your quants for my use (KoboldCPP + ST, I look at Lollms with curiosity also). My bad, but a white knight came to fix that a day before a lovely IQ4_KSS appeared, so here I am, llama-server + ST it is for now. As for the beef with mainline, well, I really regret that the quality and speed of inference went maybe a bit low into the priority list. It seemed already to be the case when Johannes Gaessler developed the first KV quant 8 bits in late 2003. Anyway, I'm glad you keep having fun by blowing up the charts. Your work is really phenomenal, and I wish that your quants became the new baseline of the GGUF side of Hugging Face. But where would be the fun in that? :X |
Beta Was this translation helpful? Give feedback.
-
Hey IK,
It's been a while you forked, and I wondered if you'd be willing to PR something close to a 4 bpw (3.8125-4.0625?, I don't know) ggml type on LlamaCPP, if you have one viable in store. The gap between IQ3_S and IQ4_XS is huge, and there are some reported problems with IQ3_S and IQ3_XXS), which can screw hybrid IQ4_XS based quants where attn_q and attn_output (or some layers of ffn gate and up) are passed in IQ3_S to fit in some VRAM configs.
Maybe with Johannes Gaessler's goodwill, It would make full offload of the 123b parameters viable on 64GB VRAM, and the 70b models viable on 36GB VRAM.
More broadly, your work is sorely missing on LCPP.
Cheers!
Beta Was this translation helpful? Give feedback.
All reactions