-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: CUDA illegal memory access related to KV/n_ctx padding and F16 DMMV #8798
Comments
This could be it, when using llama.cpp directly you would have to compile with |
I am definitely not building with GGML_CUDA_FORCE_DMMV. Note that |
I did notice in #8332 that dmmv for f16 has a minimum number of cols of
|
I'm not sure why I can't reproduce this with llama-cli, but I can reproduce it with GPT4All after the merge of PR #7257, up to and including commit 398ede5 from today (the latest I've tried).
edit: I can also reproduce this on commit 952d03d from before the padding was increased, so the extra padding for FA seems to have been masking an older bug.
Diagnostic information is given for a fork based on commit 398ede5, but line numbers won't match exactly in ggml-cuda.cu due to some extra code added for device enumeration, which is required by GPT4All.
cc @slaren @JohannesGaessler
Steps to reproduce
llama-2-7b.Q4_0.gguf
model fully offloaded to a single Tesla P40, with n_ctx=2016 (a multiple of 32 but not 256), n_batch=2048, and n_ubatch=512. Flash attention is disabled.This is the first error reported by Compute Sanitizer:
The full report from Compute Sanitizer is here.
The text was updated successfully, but these errors were encountered: