Only enable sgemm for prompt processing, not for inference #9330
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While sgemm/tinyblas was designed to speed up prompt processing using tiled matrix multiplications, llama.cpp also calls it for inference as a 1x1 computation. Personally I think it makes more sense for us to use our dedicated
ggml_vec_dot
functions for the inference dot products and leave sgemm for prompt processing only. We can optimize each one for its respective purpose and so forth.See my PR #8049 for an example where sgemm has faster prompt processing while
ggml_vec_dot
has faster inference.