Bug: CUDA illegal memory access related to KV/n_ctx padding and F16 DMMV #8798

cebtenzzre · 2024-07-31T18:54:53Z

I'm not sure why I can't reproduce this with llama-cli, but I can reproduce it with GPT4All after the merge of PR #7257, up to and including commit 398ede5 from today (the latest I've tried).

edit: I can also reproduce this on commit 952d03d from before the padding was increased, so the extra padding for FA seems to have been masking an older bug.

Diagnostic information is given for a fork based on commit 398ede5, but line numbers won't match exactly in ggml-cuda.cu due to some extra code added for device enumeration, which is required by GPT4All.

cc @slaren @JohannesGaessler

Steps to reproduce

Construct a llama-2-7b.Q4_0.gguf model fully offloaded to a single Tesla P40, with n_ctx=2016 (a multiple of 32 but not 256), n_batch=2048, and n_ubatch=512. Flash attention is disabled.
In chunks of 128 (the max batch size GPT4All uses in practice), decode 1990 tokens of input.
Sample a token.
Decode the sampled token with n_past=1990. At this point, CUDA will hit an illegal memory access, which will be reported in the next synchronize call:

pos=1989 13 '
'
sampling token n_past=1990
decode(n_past=1990, n_eval=1):
pos=1990 13 '
'
sampling token n_past=1991
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml/src/ggml-cuda.cu:2408
  cudaStreamSynchronize(cuda_ctx->stream())
/home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml/src/ggml-cuda.cu:102: CUDA error

This is the first error reported by Compute Sanitizer:

========= Invalid __global__ read of size 2 bytes
=========     at __half::operator float() const+0x358 in /opt/cuda/targets/x86_64-linux/include/cuda_fp16.hpp:136
=========     by thread (16,0,0) in block (127,0,0)
=========     Address 0x782d9d000000 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x782d5e000000 of size 1,056,964,608 bytes
=========     Device Frame:convert_f16(const void *, long, int, float2 &)+0x2c8 in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml/src/ggml-cuda/dmmv.cu:421
=========     Device Frame:void dequantize_mul_mat_vec<(ggml_type)1>(const void *, const float *, float *, int, int)+0x2c8 in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml/src/ggml-cuda/dmmv.cu:474
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2c914f]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame: [0x15803]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x75230]
=========                in /opt/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaError cudaLaunchKernel<char>(char const*, dim3, dim3, void**, unsigned long, CUstream_st*) in /opt/cuda/targets/x86_64-linux/include/cuda_runtime.h:216 [0x439658]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:__device_stub__Z22dequantize_mul_mat_vecIL9ggml_type1EEvPKvPKfPfii(void const*, float const*, float*, int, int) in /tmp/tmpxft_000138b8_00000000-6_dmmv.compute_61.cudafe1.stub.c:29 [0x438ef3]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:void __wrapper__device_stub_dequantize_mul_mat_vec<(ggml_type)1>(void const* restrict&, float const* restrict&, float* restrict&, int const&, int const&) in /tmp/tmpxft_000138b8_00000000-6_dmmv.compute_61.cudafe1.stub.c:30 [0x438f55]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:void dequantize_mul_mat_vec<(ggml_type)1>(void const*, float const*, float*, int, int) in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml/src/ggml-cuda/dmmv.cu:436 [0x439607]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:convert_mul_mat_vec_f16_cuda(void const*, float const*, float*, int, int, CUstream_st*) in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml/src/ggml-cuda/dmmv.cu:595 [0x437683]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:ggml_cuda_op_dequantize_mul_mat_vec(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*) in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml/src/ggml-cuda/dmmv.cu:662 [0x43793c]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml/src/ggml-cuda.cu:1618 [0x3f39da]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml/src/ggml-cuda.cu:1949 [0x3f58cc]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:ggml_cuda_compute_forward(ggml_backend_cuda_context&, ggml_tensor*) in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml/src/ggml-cuda.cu:2247 [0x3f6d5e]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml/src/ggml-cuda.cu:2608 [0x3f7b77]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:ggml_backend_graph_compute_async in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml/src/ggml-backend.c:282 [0x3d539f]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:ggml_backend_sched_compute_splits in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml/src/ggml-backend.c:1790 [0x3d8285]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:ggml_backend_sched_graph_compute_async in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml/src/ggml-backend.c:1977 [0x3d847b]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:llama_graph_compute(llama_context&, ggml_cgraph*, int) in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/src/llama.cpp:14504 [0x2c05c4]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:llama_decode_internal(llama_context&, llama_batch) in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/src/llama.cpp:14717 [0x2e0f0f]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:llama_decode in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/src/llama.cpp:18429 [0x2e1668]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:LLamaModel::evalTokens(LLModel::PromptContext&, std::vector<int, std::allocator<int> > const&) const in /home/jared/src/forks/gpt4all/gpt4all-backend/llamamodel.cpp:608 [0x2916cf]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllamamodel-mainline-cuda.so
=========     Host Frame:LLModel::generateResponse(std::function<bool (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>, std::function<bool (bool)>, LLModel::PromptContext&) in /home/jared/src/forks/gpt4all/gpt4all-backend/llmodel_shared.cpp:263 [0xc3af8]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllmodel.so.0
=========     Host Frame:LLModel::prompt(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<bool (int)>, std::function<bool (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>, std::function<bool (bool)>, LLModel::PromptContext&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) in /home/jared/src/forks/gpt4all/gpt4all-backend/llmodel_shared.cpp:169 [0xc4c6a]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/libllmodel.so.0
=========     Host Frame:ChatLLM::promptInternal(QList<QString> const&, QString const&, QString const&, int, int, float, float, float, int, float, int) in /home/jared/src/forks/gpt4all/gpt4all-chat/chatllm.cpp:803 [0x88573]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/chat
=========     Host Frame:ChatLLM::prompt(QList<QString> const&, QString const&) in /home/jared/src/forks/gpt4all/gpt4all-chat/chatllm.cpp:745 [0x88988]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/chat
=========     Host Frame:QtPrivate::FunctorCall<QtPrivate::IndexesList<0, 1>, QtPrivate::List<QList<QString> const&, QString const&>, void, bool (ChatLLM::*)(QList<QString> const&, QString const&)>::call(bool (ChatLLM::*)(QList<QString> const&, QString const&), ChatLLM*, void**) [0x7e8c3]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/chat
=========     Host Frame:void QtPrivate::FunctionPointer<bool (ChatLLM::*)(QList<QString> const&, QString const&)>::call<QtPrivate::List<QList<QString> const&, QString const&>, void>(bool (ChatLLM::*)(QList<QString> const&, QString const&), ChatLLM*, void**) [0x7e8fa]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/chat
=========     Host Frame:QtPrivate::QCallableObject<bool (ChatLLM::*)(QList<QString> const&, QString const&), QtPrivate::List<QList<QString> const&, QString const&>, void>::impl(int, QtPrivate::QSlotObjectBase*, QObject*, void**, bool*) [0x7e951]
=========                in /home/jared/src/forks/gpt4all/gpt4all-chat/build/bin/chat
=========     Host Frame:QObject::event(QEvent*) in /usr/src/debug/qt6-base/qtbase/src/corelib/kernel/qobject.cpp:1452 [0x18c00e]
=========                in /usr/lib/libQt6Core.so.6
=========     Host Frame:QCoreApplication::notifyInternal2(QObject*, QEvent*) in /usr/src/debug/qt6-base/qtbase/src/corelib/kernel/qcoreapplication.cpp:1142 [0x144d27]
=========                in /usr/lib/libQt6Core.so.6
=========     Host Frame:QCoreApplicationPrivate::sendPostedEvents(QObject*, int, QThreadData*) in /usr/src/debug/qt6-base/qtbase/src/corelib/kernel/qcoreapplication.cpp:1940 [0x1450ea]
=========                in /usr/lib/libQt6Core.so.6
=========     Host Frame:postEventSourceDispatch in /usr/src/debug/qt6-base/qtbase/src/corelib/kernel/qeventdispatcher_glib.cpp:244 [0x3a49eb]
=========                in /usr/lib/libQt6Core.so.6
=========     Host Frame:g_main_dispatch in ../glib/glib/gmain.c:3344 [0x5ca88]
=========                in /usr/lib/libglib-2.0.so.0
=========     Host Frame:g_main_context_iterate_unlocked.isra.0 in ../glib/glib/gmain.c:4217 [0xbe9b6]
=========                in /usr/lib/libglib-2.0.so.0
=========     Host Frame:g_main_context_iteration in ../glib/glib/gmain.c:4282 [0x5bf94]
=========                in /usr/lib/libglib-2.0.so.0
=========     Host Frame:QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) in /usr/src/debug/qt6-base/qtbase/src/corelib/kernel/qeventdispatcher_glib.cpp:394 [0x3a2cbc]
=========                in /usr/lib/libQt6Core.so.6
=========     Host Frame:QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) in /usr/src/debug/qt6-base/qtbase/src/corelib/kernel/qeventloop.cpp:182 [0x14f01d]
=========                in /usr/lib/libQt6Core.so.6
=========     Host Frame:QThread::run() in /usr/src/debug/qt6-base/qtbase/src/corelib/thread/qthread.cpp:707 [0x23a55f]
=========                in /usr/lib/libQt6Core.so.6
=========     Host Frame:QThreadPrivate::start(void*) in /usr/src/debug/qt6-base/qtbase/src/corelib/thread/qthread_unix.cpp:285 [0x2c9746]
=========                in /usr/lib/libQt6Core.so.6
=========     Host Frame:start_thread in /usr/src/debug/glibc/glibc/nptl/pthread_create.c:447 [0x92dec]
=========                in /usr/lib/libc.so.6
=========     Host Frame:clone3 in ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78 [0x1160db]
=========                in /usr/lib/libc.so.6

The full report from Compute Sanitizer is here.

The text was updated successfully, but these errors were encountered:

JohannesGaessler · 2024-07-31T19:20:55Z

Device Frame:void dequantize_mul_mat_vec<(ggml_type)1>(const void *, const float *, float *, int, int)+0x2c8 in /home/jared/src/forks/gpt4all/gpt4all-backend/llama.cpp-mainline/ggml/src/ggml-cuda/dmmv.cu:474

This could be it, when using llama.cpp directly you would have to compile with GGML_CUDA_FORCE_DMMV to enter that code path.
Though honestly I want to remove dmmv anyways and replace it with a specialized FP16 (and maybe FP32) kernel.
For quantized matrices mmvq should be universally better.

cebtenzzre · 2024-07-31T19:52:20Z

This could be it, when using llama.cpp directly you would have to compile with GGML_CUDA_FORCE_DMMV to enter that code path.

I am definitely not building with GGML_CUDA_FORCE_DMMV. Note that type is 1 (F16) here, so MMVQ is not applicable.

Upstream issue: ggerganov#8798

slaren · 2024-07-31T20:08:56Z

I did notice in #8332 that dmmv for f16 has a minimum number of cols of GGML_CUDA_DMMV_X*2 since it processes two elements per iteration. However I missed that it also requires the number of cols to be a multiple of GGML_CUDA_DMMV_X*2. I verified with test-backend-ops that it fails when ncols is not a multiple of 64. Eg:

  MUL_MAT(type_a=f16,type_b=f32,m=92,n=1,k=96,bs=[1,1],nr=[1,1]): [MUL_MAT] NaN at index 91 (CUDA0=nan CPU=-2.490579) FAIL
  MUL_MAT(type_a=f16,type_b=f32,m=12,n=1,k=160,bs=[1,1],nr=[1,1]): [MUL_MAT] NaN at index 11 (CUDA0=nan CPU=0.624335) FAIL
  MUL_MAT(type_a=f16,type_b=f32,m=1,n=1,k=224,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.032905701 > 0.000500000 FAIL
  MUL_MAT(type_a=f16,type_b=f32,m=4,n=1,k=384,bs=[1,1],nr=[1,1]): OK
  MUL_MAT(type_a=f16,type_b=f32,m=121,n=1,k=128,bs=[1,1],nr=[1,1]): OK
  MUL_MAT(type_a=f16,type_b=f32,m=24,n=1,k=192,bs=[1,1],nr=[1,1]): OK
  MUL_MAT(type_a=f16,type_b=f32,m=114,n=1,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.998501467 > 0.000500000 FAIL
  MUL_MAT(type_a=f16,type_b=f32,m=20,n=1,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 1.000022777 > 0.000500000 FAIL
  MUL_MAT(type_a=f16,type_b=f32,m=50,n=1,k=224,bs=[1,1],nr=[1,1]): [MUL_MAT] NaN at index 49 (CUDA0=nan CPU=-3.370481) FAIL
  MUL_MAT(type_a=f16,type_b=f32,m=64,n=1,k=96,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 1.000294778 > 0.000500000 FAIL
  MUL_MAT(type_a=f16,type_b=f32,m=76,n=1,k=448,bs=[1,1],nr=[1,1]): OK
  MUL_MAT(type_a=f16,type_b=f32,m=76,n=1,k=512,bs=[1,1],nr=[1,1]): OK
  MUL_MAT(type_a=f16,type_b=f32,m=72,n=1,k=96,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 1.000190292 > 0.000500000 FAIL
  MUL_MAT(type_a=f16,type_b=f32,m=126,n=1,k=224,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 1.000010252 > 0.000500000 FAIL
  MUL_MAT(type_a=f16,type_b=f32,m=19,n=1,k=320,bs=[1,1],nr=[1,1]): OK

cebtenzzre added bug Something isn't working high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Jul 31, 2024

cebtenzzre changed the title ~~Bug: CUDA illegal memory access after PR #7257~~ Bug: CUDA illegal memory access related to KV/n_ctx padding and F16 DMMV Jul 31, 2024

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this issue Jul 31, 2024

llama : work around F16 DMMV buffer overflow by increasing padding

7ea0fed

Upstream issue: ggerganov#8798

slaren mentioned this issue Jul 31, 2024

cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X #8800

Merged

cebtenzzre mentioned this issue Jul 31, 2024

backend: fix extra spaces in tokenization and a CUDA crash nomic-ai/gpt4all#2778

Merged

slaren closed this as completed in #8800 Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: CUDA illegal memory access related to KV/n_ctx padding and F16 DMMV #8798

Bug: CUDA illegal memory access related to KV/n_ctx padding and F16 DMMV #8798

cebtenzzre commented Jul 31, 2024 •

edited

Loading

JohannesGaessler commented Jul 31, 2024

cebtenzzre commented Jul 31, 2024 •

edited

Loading

slaren commented Jul 31, 2024 •

edited

Loading

Bug: CUDA illegal memory access related to KV/n_ctx padding and F16 DMMV #8798

Bug: CUDA illegal memory access related to KV/n_ctx padding and F16 DMMV #8798

Comments

cebtenzzre commented Jul 31, 2024 • edited Loading

Steps to reproduce

JohannesGaessler commented Jul 31, 2024

cebtenzzre commented Jul 31, 2024 • edited Loading

slaren commented Jul 31, 2024 • edited Loading

cebtenzzre commented Jul 31, 2024 •

edited

Loading

cebtenzzre commented Jul 31, 2024 •

edited

Loading

slaren commented Jul 31, 2024 •

edited

Loading