Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only enable sgemm for prompt processing, not for inference #9330

Merged
merged 1 commit into from
Sep 7, 2024

Conversation

netrunnereve
Copy link
Collaborator

While sgemm/tinyblas was designed to speed up prompt processing using tiled matrix multiplications, llama.cpp also calls it for inference as a 1x1 computation. Personally I think it makes more sense for us to use our dedicated ggml_vec_dot functions for the inference dot products and leave sgemm for prompt processing only. We can optimize each one for its respective purpose and so forth.

See my PR #8049 for an example where sgemm has faster prompt processing while ggml_vec_dot has faster inference.

@ggerganov ggerganov merged commit e536426 into ggerganov:master Sep 7, 2024
52 checks passed
@netrunnereve netrunnereve deleted the sgemm_pp branch September 8, 2024 01:03
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants