-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster ssm scan #10558
base: master
Are you sure you want to change the base?
Faster ssm scan #10558
Conversation
I tested this PR today. I can confirm (without understanding the implementation) that it is good:
The reformatting of the entire ssm_scan.cu is a bit unfortunate, as it makes the modifications vs. latest version from #9186 hard to follow; but the source code seems to have changed in sufficiently many places to warrant it. So, unless the author wishes additional review, I would advise to disregard PR #9186 and to merge in this PR instead to finally get Mamba CUDA support into the official llama.cpp releases. |
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
case GGML_OP_SSM_CONV: | ||
ggml_cuda_op_ssm_conv(ctx, dst); | ||
break; | ||
case GGML_OP_SSM_SCAN: | ||
ggml_cuda_op_ssm_scan(ctx, dst); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case GGML_OP_SSM_CONV: | |
ggml_cuda_op_ssm_conv(ctx, dst); | |
break; | |
case GGML_OP_SSM_SCAN: | |
ggml_cuda_op_ssm_scan(ctx, dst); | |
break; | |
case GGML_OP_SSM_CONV: | |
ggml_cuda_op_ssm_conv(ctx, dst); | |
break; | |
case GGML_OP_SSM_SCAN: | |
ggml_cuda_op_ssm_scan(ctx, dst); | |
break; |
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
return true; | ||
case GGML_OP_SSM_SCAN: | ||
case GGML_OP_SSM_CONV: | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return true; | |
return true; |
I can provide some extra explanations for CUDA code. The biggest performance improvement comes from further partitioning, which divides d_inner into different blocks and increases the number of threads per block to 128. In addition, since A is repeatedly reading and hidden_state is repeatedly reading and writing, I placed them on shared memory to reduce the read and write of off chip storage. (During the inference process, L is usually 1, and this improvement is not significant. In the complexity test, L was 512, which may have some performance improvement.) As for the thread partitioning within the block, the index calculation in the code may be quite complicated. In fact, each thread calculates one line of the partitioned hidden state. Perhaps there will be further optimization methods here, such as more reasonable memory coalescing or more reasonable swizzle methods to remove the additional column in shared memory. In addition, I used nsight compute to profile the perlexity test, and the image shows a slice computed up to a certain point in time (there are too many kernels, I don't have much time to run all of them). It can be seen that most of the time is now focused on matrix multiplication calculations, so the performance of the current scan kernel may be sufficient. (Or in the architectures of Ampere and Hopper, the matrix computation time can be further reduced, and this kernel has become a bottleneck again? I don't have these graphics cards, so I'm not sure.) Finally, I would like to point out that this kernel only ensures correctness when d_inner% 128==0&&d_state==16 is true. Since I am not sure if there are any other additional situations, you can add extra security checks to ensure this, or if you really need to consider it, please let me know the specific size, and the code needs to be further modified. |
This would be a good idea. Add |
Faster ssm_conv has also been implemented. Should I open a new PR or just continue updating in this PR? @jploski |
I think it makes sense to keep the entire "Mamba CUDA implementation" in a single PR. I believe the invidual commits will be squashed together upon merging into master, but it's ok as the history remains in the PR as reference. Note that I'm not a committer for llama.cpp, so it's just an opinion. |
Add faster ssm conv implementation.
cli
|
I wrote a faster ssm_scan compared with pr#9186. @jploski
Since there is no ssm cuda file in master branch, so i copied pr#9186 code, the only file that needs attention is ssm_scan.cu.
the debug and release build has passed ci test, since I cannot visit huggingface, so I didn't run the following part of ci.
the following is performance experiment:
perplexity
my implementation:
pr#9186:
cli
my implementation:
pr#9186: