Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda : ggml_mul_mat assert for padded src1 #673

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ggerganov
Copy link
Owner

Currently, the padded matrix multiplications in whisper.cpp are silently failing with CUDA:

// TODO: check if other platforms can benefit from this optimization
// TODO: CUDA is currently broken - seems ggml_mul_mat does not handle views correctly
#if defined(GGML_USE_METAL)
#define ggml_mul_mat ggml_mul_mat_pad
#endif

The reason is that the to_fp16_cuda and to_fp32_cuda calls assume no padding of the data. We can either assert that the data is not padded, or over-allocate a buffer accounting for the padding. The latter produces correct results, but is sub-optimal.

Drafting this PR to brainstorm some potential solutions

@ggerganov ggerganov requested a review from slaren December 29, 2023 08:43
@slaren
Copy link
Collaborator

slaren commented Dec 29, 2023

Is there any benefit to using ggml_mul_mat_pad with CUDA (and specially with cuBLAS)? This seems to me like it should be a implementation detail of the matrix multiplication algorithm.

@ggerganov
Copy link
Owner Author

Can't test the ggml_mul_mat_pad with CUDA because it would need some extra changes to to_fp16_cuda. Likely there will be no benefit in this case because cuBLAS will probably make the necessary optimizations internally. For Metal it helps because that mat-mat kernel requires dim 0 to be multiple of 32

But regardless, the issue remains if users try to perform matrix multiplications with padded views

@slaren
Copy link
Collaborator

slaren commented Dec 29, 2023

I think that in general there is very poor support for any kind of non-contiguous tensors in ggml-cuda. We should definitely add more checks and improve support, but I am afraid that this is just the tip of the iceberg. We could extend test-backend-ops to automatically run tests with non-contiguous tensors for all the operations, so that we could have a better overall picture of what needs to be fixed.

@slaren
Copy link
Collaborator

slaren commented Dec 29, 2023

At the moment I am working on moving llama.cpp fully to ggml-backend with ggml_backend_sched for partial offloading, and bringing everything up to feature parity with the old CUDA API. After that, we can remove a lot of code, globals, and generally do a lot of refactoring of the CUDA backend.

@ggerganov
Copy link
Owner Author

Sounds great - no rush on this padding problem

@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 29, 2023

I think the metal backend could benefit from expanding matrix multiplication support and significantly improving performance in some cases such as comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants