-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementations for Q4_0_8_8 quantization based functions in AVX2 SIMD architecture #8713
Implementations for Q4_0_8_8 quantization based functions in AVX2 SIMD architecture #8713
Conversation
ggml/src/ggml-aarch64.c
Outdated
__m256i requiredOrder = _mm256_set_epi32(3 ,2 ,1 ,0, 7 ,6, 5, 4); | ||
|
||
// Take group of four block_q8_0x4 structures at each pass of the loop and perform dot product operation | ||
for (; y < nr / 4; y += 4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ne11 is processed in batches of 16 in GEMM function. Leftover ne11 is processed in batches of four. Saw higher boostup in performance while processing ne11 in batches of 16 and leftover in batches of 4 versus just processing ne11 in batches of four
this isn't a new conversion type, right? it's just a new way of calculating Q4_0? |
Hi @bartowski1182 , the Q4_0_8_8 is a format of quantization where the values are stored in the same 4 bit quantized fomat, along with the same delta values as Q4_0. The 4 bit quantized quants values across eight different blocks are interleaved with each other. This was introduced in PR #5780 . Models that needs to use this particular code path, needs to be quantized in this particular format of Q4_0_8_8. Thanks |
The main benefit from these changes should be in the prompt processing speed, not the text generation. Better to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Srihari-mcw Could you make a perplexity comparison before merging? For example against a Q4_0
model at PPL 32 chunks
…8x8_q8_0 and ggml_gemm_q4_0_8x8_q8_0 functions
…o be processed as multiple of 16 in MSVC
81d9078
to
c950fc3
Compare
Hi @ggerganov, The perplexity was measured for models quantized from meta llama2 7B model with the following command : It calculated perplexity over 655 chunks : The perplexity results are tabulated as follows :
The perplexity readings were found to be almost the same post the tests Further, post the latest changes in master branch and in the PR, the performance readings are as follows GCC Linux : Q4_0 Model :
GCC Version = 12.3 The PR was tested in AMD Raphael 7600X which supports the following flags by default : AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1| Thanks |
It looks like the rocm compiler is crashing when compiling this code, which is breaking the generation of docker images.
|
* Add AVX2 based implementations for quantize_q8_0_4x8, ggml_gemv_q4_0_8x8_q8_0 and ggml_gemm_q4_0_8x8_q8_0 functions * Update code to fix issues occuring due to non alignment of elements to be processed as multiple of 16 in MSVC * Update comments and indentation * Make updates to reduce number of load instructions
* Add AVX2 based implementations for quantize_q8_0_4x8, ggml_gemv_q4_0_8x8_q8_0 and ggml_gemm_q4_0_8x8_q8_0 functions * Update code to fix issues occuring due to non alignment of elements to be processed as multiple of 16 in MSVC * Update comments and indentation * Make updates to reduce number of load instructions
* Add AVX2 based implementations for quantize_q8_0_4x8, ggml_gemv_q4_0_8x8_q8_0 and ggml_gemm_q4_0_8x8_q8_0 functions * Update code to fix issues occuring due to non alignment of elements to be processed as multiple of 16 in MSVC * Update comments and indentation * Make updates to reduce number of load instructions
GCC Linux :
Q4_0 Model :
The models were quantized and tested from meta-llama2 7B model - https://huggingface.co/meta-llama/Llama-2-7b
The PR was tested in AMD Raphael 7600X which supports the following flags by default :
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1|