Only enable sgemm for prompt processing, not for inference #9330

netrunnereve · 2024-09-06T04:17:12Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

While sgemm/tinyblas was designed to speed up prompt processing using tiled matrix multiplications, llama.cpp also calls it for inference as a 1x1 computation. Personally I think it makes more sense for us to use our dedicated ggml_vec_dot functions for the inference dot products and leave sgemm for prompt processing only. We can optimize each one for its respective purpose and so forth.

See my PR #8049 for an example where sgemm has faster prompt processing while ggml_vec_dot has faster inference.

only enable sgemm for prompt processing

3222aae

slaren approved these changes Sep 7, 2024

View reviewed changes

ggerganov merged commit e536426 into ggerganov:master Sep 7, 2024
52 checks passed

netrunnereve deleted the sgemm_pp branch September 8, 2024 01:03

netrunnereve mentioned this pull request Sep 11, 2024

IQ4_NL sgemm + Q4_0 AVX optimization #9422

Merged

4 tasks

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

llamafile : disable sgemm for batch-size 1 (ggerganov#9330)

2430c63

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

llamafile : disable sgemm for batch-size 1 (ggerganov#9330)

78fceb3

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

llamafile : disable sgemm for batch-size 1 (ggerganov#9330)

e7f5c7d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only enable sgemm for prompt processing, not for inference #9330

Only enable sgemm for prompt processing, not for inference #9330

netrunnereve commented Sep 6, 2024

Only enable sgemm for prompt processing, not for inference #9330

Only enable sgemm for prompt processing, not for inference #9330

Conversation

netrunnereve commented Sep 6, 2024