IQ4_NL sgemm + Q4_0 AVX optimization #9422
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This contains two changes in one, the first is basically a copy over of my shelved #8049 (IQ4_NL sgemm) which I planned to resubmit after #9330 got merged. IQ4_NL is basically Q4_0 with a lookup table and sgemm can be easily ported over to that quant.
I was able to test on an AVX2 machine this time so I've enabled this change for both AVX and AVX2. AVX2 is much faster due to #8908.
AVX2 (35% prompt processing improvement):
AVX (10% prompt processing improvement):
As our tests don't cover sgemm I ran a 10 chunk Wikitext perplexity with an IQ4_NL model and the numbers were within 0.2%. I also ran through some sample prompts and the model responded properly. If needed I can run with more chunks but it's going to take forever on my slow computer.
The second change basically makes the Q4_0 ggml_vec_dot function compute two blocks at a time for regular AVX, just like how it's done for IQ4_NL. This makes inference 7% faster.
From my testing this technique only helps with Q4_0 and doesn't do anything on Q8_0, which currently only calculates one block at a time. I think the eight loads (and hence eight registers) required to store two Q8_0 blocks adds way too much overhead.
test-quantize-fns
andtest-backend-ops
is passing for this PR.P.S. I saw that F16C was used in #8908 and wanted to see if that worked for inference as well, so I modified the IQ4_NL ggml_vec_dot function to convert and multiply the scales for four blocks at a time. Sadly that didn't have a visible performance impact and I removed it from this PR, though my code can be found in a201c6b if anyone wants to experiment with it.