-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Q8_0 and Q4_0 with Bf16 delta values #7497
base: master
Are you sure you want to change the base?
Introduce Q8_0 and Q4_0 with Bf16 delta values #7497
Conversation
b9a5d91
to
a9eaa9e
Compare
Additional Note : ggml_vec_dot_q4_0_q8_0 does not contain any change. An additional function ggml_vec_dot_q4_0_b16_q8_0_b16 was added for the new Q4_0_B16 type model just after ggml_vec_dot_q4_0_q8_0. Github however shows difference in the "Files modified" section for ggml_vec_dot_q4_0_q8_0 function in ggml-quants.c |
@@ -17,6 +17,7 @@ struct quant_option { | |||
|
|||
static const std::vector<struct quant_option> QUANT_OPTIONS = { | |||
{ "Q4_0", LLAMA_FTYPE_MOSTLY_Q4_0, " 3.56G, +0.2166 ppl @ LLaMA-v1-7B", }, | |||
{ "Q4_0_B16", LLAMA_FTYPE_MOSTLY_Q4_0_B16, " 3.56G, 5.9624 +/- 0.03348 ppl @ LLaMA-v2-7B", }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The perplexity score mentioned here was derived from running perplexity.exe with Q4_0_B16 quantized model for meta llama2 7B model. Was unsure of methodology for the other ppl scores put across here. Kindly share feedback if the score put here needs to be modified. Thanks
That's great. |
You may want to rebase this on top of master so the CI can completely pass (a CI fault has been bypassed for now) |
135cec2
to
46c0cd7
Compare
@sorasoras , The optimization changes were primarily done with CPU SIMD Instructions and was tested on a CPU backend. Thanks |
@mofosyne , The branch was rebased on top of current master branch. Thanks |
46c0cd7
to
138cd22
Compare
2fd0a10
to
eb1116a
Compare
Nice idea. I don't have much time to look into details right now, but the overall the implementation looks good. I'll need more time to test it. Also, I'm quite interested in the performance test on AVX256 because AFAIK CC @jart for the tinyblas part |
gguf-py/gguf/constants.py
Outdated
Q4_0_B16 = 31 | ||
Q8_0_B16 = 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q4_0_B16 = 31 | |
Q8_0_B16 = 32 | |
Q4_0_B16 = 34 | |
Q8_0_B16 = 35 |
These should be the same as in ggml.h
.
This change was written by Srihari-mcw. These new quants are the same as Q8_0 and Q4_0 except BF16 is used (instead of F16) as the scaling scalar See ggerganov/llama.cpp#7497
eb1116a
to
e9305da
Compare
It may be worth checking if there is some optimization issue with the implementation of these quants, because it is hard to imagine that a single fp16 to fp32 conversion per block could be so expensive. Currently, F16C is not used to convert between fp16 and fp32 because tests showed it to be slower than a lookup table. I suspect this is because the instruction has a significant latency, but it should be possible to hide most of this latency by reordering the instructions and unrolling the loops. My general view is that we already have more quant types than we should, each quant type is a maintenance burden for us and for the backend developers, it makes the choice harder for the users, and it adds to the work of the people quantizing the models. We should avoid adding new types unless strictly necessary, and we should look into removing some of the outdated formats that have been effectively replaced by more efficient alternatives (which likely would include Q4_0). |
I've created a branch on the llamafile repository where I've imported this pull request. I renamed your quantization formats Here are your benchmarks on a CPU that supports AVX512F BF16.
While I don't think we really have much to gain in terms of performance here. Each q4/q8 block has a single f16 scalar. SIMD doesn't help when you're dealing with scalars. This change goes too far out of its way to call |
Hi @slaren, @jart - Recently, similar changes of parallel delta value multiplication combined with loop unrolling for 4xN and Mx4 dimensions were tried with existing quantization types with FP16 delta values and we were able to observe gains in performance in our platforms. The corresponding PR #8908 is attached here and the performance details are also attached here for your reference. Please have a look on the same. Thanks GCC Linux : Meta Llama2 7B model: Q4_0 Model :
Q8_0 Model :
Mistral-7B-Instruct-v0.3 model: Q4_0 Model :
Q8_0 Model :
GCC Version = 12.3 The PR was tested in AMD Raphael 7600X which supports the following flags by default : AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1| Original Unquantized Models : Llama2 7B : https://huggingface.co/meta-llama/Llama-2-7b |
The PR #8908 was also tested in an AMD Ryzen ThreadRipper PRO 5995WX machine. Test Results are attached below along with flags supported and other details Performance Results in AMD Ryzen Threadripper PRO 5995WX GCC Linux : Mistral-7B-Instruct-v0.3 model: Q4_0 Model :
Q8_0 Model :
GCC Version = 12.3 The machine supports the following flags by default : | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | Original Unquantized Models : Llama2 7B : https://huggingface.co/meta-llama/Llama-2-7b |
…pe - Based on 5f2e011e2eed2f685521c707b3e74280fcb81dd3 from llamafile
e9305da
to
43c7be5
Compare
That's an impressive achievement. Congratulations. Thanks for showing us the numbers. Now we just have to decide is adding and maintaining a new quantization format forever is worth it to make znver4 go 13% faster :-) |
We want to reiterate that, in PR #8908, we have retained the original quantization types with FP16 delta (The PR #8908 does not contain a new quantization format). In Zen 4 (Raphael 7600X) the gains observed for prompt processing stands at approx 35% and 20% for Q4_0 and Q8_0. In TR 5995WX, the gains observed for prompt processing stands at approx 11% and 12.5% for Q4_0 and Q8_0. For more info, refer to PR #8908. Thanks Prompt Processing Test Results with PR #8908 for Mistral-7B-Instruct-v0.3 model for Q4_0 and Q8_0 models in GCC Linux 12.3 AMD Raphael 7600X (Zen 4)
AMD Ryzen Threadripper PRO 5995WX
Notable differences in flags between ThreadRipper 5995WX and AMD Raphael 7600X : AMD Raphael 7600X supports AVX512, AVX512_VNNI, AVX512_VBMI, AVX512_BF16 whereas ThreadRipper 5995WX does not |
GCC Linux :
Q8_0 Model :
Q4_0 Model :
MSVC Windows :
Q8_0 Model :
Q4_0 Model :
The PR was tested in AMD Raphael 7600X which supports AVX512_BF16. AVX512_BF16 was enabled in Windows by cmake .. -DLLAMA_AVX512_BF16=ON
The models were quantized and tested from meta-llama2 7B model - https://huggingface.co/meta-llama/Llama-2-7b