-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[backend](cuda): faster uncontiguous concat #10760
[backend](cuda): faster uncontiguous concat #10760
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check NVIDIA NSight Systems and look at the runtime for different kernel configurations. It may be that the variant you looked at with NSight Compute is faster but another one has become slower (though I wouldn't know why).
ggml/src/ggml-cuda/concat.cu
Outdated
default: | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default: | |
break; | |
default: | |
GGML_ABORT("fatal error"); | |
break; |
ggml/src/ggml-cuda/concat.cu
Outdated
if (i0 < ne00 && i1 < ne01 && i2 < ne02 && i3 < ne03) { | ||
x = (const float *)(src0 + (i3 )*nb03 + (i2 )*nb02 + (i1 )*nb01 + (i0 )*nb00); | ||
} else { | ||
x = (const float *)(src1 + (i3 - o[3])*nb13 + (i2 - o[2])*nb12 + (i1 - o[1])*nb11 + (i0 - o[0])*nb10); | ||
if /*constexpr*/ (dim == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a static_assert
for dim
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that if constexpr
should be available now that we are using C++17, no need to keep it commented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that
if constexpr
should be available now that we are using C++17, no need to keep it commented.
Thank you for your advice, I removed the comments. I also found that the number and content of instructions generated by the compiler did not change with or without constexpr. Perhaps nvcc has already optimized for this point
Co-authored-by: Diego Devesa <[email protected]>
I'll give it a try. |
Since this computer cannot run a model as large as Deepseek, I used the existing MAMBA model, but the problem should be quite similar. Kernel execution my implementation:
pr #10558(It can be considered as master, because changes are not related to concat):
Runtime function
pr #10558
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this computer cannot run a model as large as Deepseek, I used the existing MAMBA model, but the problem should be quite similar.
You can profile the code from the command line with something like nsys profile ./llama-bench
, then copy the resulting file to your desktop where you can analyze it. But these numbers already look very good so I would be fine with merging this as-is.
The profile information was obtained through the nsight system, |
* faster uncontiguous concat * Use a lambda to avoid code duplication Co-authored-by: Diego Devesa <[email protected]> * Update ggml/src/ggml-cuda/concat.cu * add constexpr and static assert --------- Co-authored-by: Diego Devesa <[email protected]>
Faster implementation of uncontiguous concat on cuda.
Performance experiment:
back-end-ops:
test case
test_cases.emplace_back(new test_concat(GGML_TYPE_F32, {512, 1024, 5, 1}, 1024, 0, 1));
my implementation:
master:
llama-bench:
command:
~/program/forked/llama.cpp/build/bin$ ./llama-bench -m ~/program/forked/DeepSeek-V2-Lite-Chat.IQ1_S.gguf
the first uncontiguous concat kernel:
my implementation:
master:
the overall result:
my implementation:
master: