IQ4_NL sgemm + Q4_0 AVX optimization #9422

netrunnereve · 2024-09-11T03:26:53Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

This contains two changes in one, the first is basically a copy over of my shelved #8049 (IQ4_NL sgemm) which I planned to resubmit after #9330 got merged. IQ4_NL is basically Q4_0 with a lookup table and sgemm can be easily ported over to that quant.

I was able to test on an AVX2 machine this time so I've enabled this change for both AVX and AVX2. AVX2 is much faster due to #8908.

AVX2 (35% prompt processing improvement):

model	size	params	backend	threads	test	t/s
llama 8B IQ4_NL - 4.5 bpw (Master)	4.35 GiB	8.03 B	CPU	4	pp512	7.01 ± 0.06
llama 8B IQ4_NL - 4.5 bpw (Master)	4.35 GiB	8.03 B	CPU	4	tg128	2.88 ± 0.02
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	4	pp512	9.46 ± 0.13
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	4	tg128	2.86 ± 0.08

AVX (10% prompt processing improvement):

model	size	params	backend	threads	test	t/s
llama 8B IQ4_NL - 4.5 bpw (Master)	4.35 GiB	8.03 B	CPU	8	pp512	9.28 ± 0.03
llama 8B IQ4_NL - 4.5 bpw (Master)	4.35 GiB	8.03 B	CPU	8	tg128	6.85 ± 0.11
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	pp512	10.23 ± 0.04
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	tg128	6.96 ± 0.00

As our tests don't cover sgemm I ran a 10 chunk Wikitext perplexity with an IQ4_NL model and the numbers were within 0.2%. I also ran through some sample prompts and the model responded properly. If needed I can run with more chunks but it's going to take forever on my slow computer.

The second change basically makes the Q4_0 ggml_vec_dot function compute two blocks at a time for regular AVX, just like how it's done for IQ4_NL. This makes inference 7% faster.

From my testing this technique only helps with Q4_0 and doesn't do anything on Q8_0, which currently only calculates one block at a time. I think the eight loads (and hence eight registers) required to store two Q8_0 blocks adds way too much overhead.

model	size	params	backend	threads	test	t/s
llama 8B Q4_0 (Master)	4.33 GiB	8.03 B	CPU	8	pp512	10.21 ± 0.00
llama 8B Q4_0 (Master)	4.33 GiB	8.03 B	CPU	8	tg128	6.57 ± 0.01
llama 8B Q4_0 (PR)	4.33 GiB	8.03 B	CPU	8	pp512	10.22 ± 0.05
llama 8B Q4_0 (PR)	4.33 GiB	8.03 B	CPU	8	tg128	7.02 ± 0.03

test-quantize-fns and test-backend-ops is passing for this PR.

P.S. I saw that F16C was used in #8908 and wanted to see if that worked for inference as well, so I modified the IQ4_NL ggml_vec_dot function to convert and multiply the scales for four blocks at a time. Sadly that didn't have a visible performance impact and I removed it from this PR, though my code can be found in a201c6b if anyone wants to experiment with it.

readd my iq4_nl sgemm PR #8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per #8549 we can calculate several blocks at a time with no issue

* squashed readd my iq4_nl sgemm PR ggerganov#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

netrunnereve added 4 commits September 9, 2024 22:53

squashed

6b780d8

readd my iq4_nl sgemm PR #8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per #8549 we can calculate several blocks at a time with no issue

shuffle

a201c6b

remove f16c iq4_nl as i cant make it faster than before

a753b25

Merge branch 'ggerganov:master' into avx_optimizations

d635c75

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 11, 2024

ggerganov approved these changes Sep 13, 2024

View reviewed changes

ggerganov merged commit 5c3d0f1 into ggerganov:master Sep 16, 2024
52 checks passed

netrunnereve deleted the avx_optimizations branch September 16, 2024 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IQ4_NL sgemm + Q4_0 AVX optimization #9422

IQ4_NL sgemm + Q4_0 AVX optimization #9422

netrunnereve commented Sep 11, 2024 •

edited

Loading

IQ4_NL sgemm + Q4_0 AVX optimization #9422

IQ4_NL sgemm + Q4_0 AVX optimization #9422

Conversation

netrunnereve commented Sep 11, 2024 • edited Loading

AVX2 (35% prompt processing improvement):

AVX (10% prompt processing improvement):

netrunnereve commented Sep 11, 2024 •

edited

Loading