Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster ssm scan #10558

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Faster ssm scan #10558

wants to merge 6 commits into from

Conversation

A3shTnT
Copy link
Contributor

@A3shTnT A3shTnT commented Nov 28, 2024

I wrote a faster ssm_scan compared with pr#9186. @jploski
Since there is no ssm cuda file in master branch, so i copied pr#9186 code, the only file that needs attention is ssm_scan.cu.
the debug and release build has passed ci test, since I cannot visit huggingface, so I didn't run the following part of ci.
the following is performance experiment:

perplexity

./llama-perplexity -m ~/program/mamba-130m-hf-f16.gguf -f ~/program/wikitext-2-raw/wiki.test.raw -ngl 99

my implementation:

Final estimate: PPL = 22.6059 +/- 0.17895

llama_perf_context_print:        load time =     931.91 ms
llama_perf_context_print: prompt eval time =  227378.97 ms / 286720 tokens (    0.79 ms per token,  1260.98 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  235974.70 ms / 286721 tokens

pr#9186:

Final estimate: PPL = 22.6059 +/- 0.17895

llama_perf_context_print:        load time =     859.59 ms
llama_perf_context_print: prompt eval time =  754629.16 ms / 286720 tokens (    2.63 ms per token,   379.95 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  770359.79 ms / 286721 tokens

cli

./llama-cli -m ~/program/mamba-130m-hf-f16.gguf -p "Once upon a time,"  --no-context-shift -n 128 -ngl 99

my implementation:

Once upon a time, when I was a child, and my grandmother was a young woman, a girl named Fannie, she had two children, and the one who was born after me was called Fannie. She was my first grandchild, and I was very fond of her. One day, I saw her little black eyes, and I said to her, "Fannie, you must be a very pretty girl. I have not seen you since I was a little girl." She said, "It is very strange, because I think I have seen a child of your age." I said, "Oh, you have been a very strange child,"

llama_perf_sampler_print:    sampling time =       6.25 ms /   133 runs   (    0.05 ms per token, 21276.60 tokens per second)
llama_perf_context_print:        load time =     831.55 ms
llama_perf_context_print: prompt eval time =      22.55 ms /     5 tokens (    4.51 ms per token,   221.73 tokens per second)
llama_perf_context_print:        eval time =    1140.36 ms /   127 runs   (    8.98 ms per token,   111.37 tokens per second)
llama_perf_context_print:       total time =    1186.22 ms /   132 tokens

pr#9186:

Once upon a time, there was a man who knew the secret of a great magic, and the secret of a great magic was to be born into the world, and the only way to make the magic was to use magic himself.

"The man who knew the secret of a great magic, and the secret of a great magic was to be born into the world, and the only way to make the magic was to use magic himself."

Now I'm getting into the mind-set of the "I think." And I think I've had this in my head since I was a kid. I've got it in my head that I'll

llama_perf_sampler_print:    sampling time =       7.16 ms /   133 runs   (    0.05 ms per token, 18575.42 tokens per second)
llama_perf_context_print:        load time =    1027.66 ms
llama_perf_context_print: prompt eval time =      35.86 ms /     5 tokens (    7.17 ms per token,   139.42 tokens per second)
llama_perf_context_print:        eval time =    1383.74 ms /   127 runs   (   10.90 ms per token,    91.78 tokens per second)
llama_perf_context_print:       total time =    1445.46 ms /   132 tokens

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2024
@A3shTnT A3shTnT closed this Nov 28, 2024
@A3shTnT A3shTnT reopened this Nov 28, 2024
@jploski
Copy link
Contributor

jploski commented Dec 2, 2024

I tested this PR today. I can confirm (without understanding the implementation) that it is good:

The reformatting of the entire ssm_scan.cu is a bit unfortunate, as it makes the modifications vs. latest version from #9186 hard to follow; but the source code seems to have changed in sufficiently many places to warrant it.

So, unless the author wishes additional review, I would advise to disregard PR #9186 and to merge in this PR instead to finally get Mamba CUDA support into the official llama.cpp releases.

Comment on lines 2160 to 2165
case GGML_OP_SSM_CONV:
ggml_cuda_op_ssm_conv(ctx, dst);
break;
case GGML_OP_SSM_SCAN:
ggml_cuda_op_ssm_scan(ctx, dst);
break;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case GGML_OP_SSM_CONV:
ggml_cuda_op_ssm_conv(ctx, dst);
break;
case GGML_OP_SSM_SCAN:
ggml_cuda_op_ssm_scan(ctx, dst);
break;
case GGML_OP_SSM_CONV:
ggml_cuda_op_ssm_conv(ctx, dst);
break;
case GGML_OP_SSM_SCAN:
ggml_cuda_op_ssm_scan(ctx, dst);
break;

return true;
case GGML_OP_SSM_SCAN:
case GGML_OP_SSM_CONV:
return true;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return true;
return true;

@A3shTnT
Copy link
Contributor Author

A3shTnT commented Dec 3, 2024

I tested this PR today. I can confirm (without understanding the implementation) that it is good:

The reformatting of the entire ssm_scan.cu is a bit unfortunate, as it makes the modifications vs. latest version from #9186 hard to follow; but the source code seems to have changed in sufficiently many places to warrant it.

So, unless the author wishes additional review, I would advise to disregard PR #9186 and to merge in this PR instead to finally get Mamba CUDA support into the official llama.cpp releases.

I can provide some extra explanations for CUDA code. The biggest performance improvement comes from further partitioning, which divides d_inner into different blocks and increases the number of threads per block to 128. In addition, since A is repeatedly reading and hidden_state is repeatedly reading and writing, I placed them on shared memory to reduce the read and write of off chip storage. (During the inference process, L is usually 1, and this improvement is not significant. In the complexity test, L was 512, which may have some performance improvement.)

As for the thread partitioning within the block, the index calculation in the code may be quite complicated. In fact, each thread calculates one line of the partitioned hidden state. Perhaps there will be further optimization methods here, such as more reasonable memory coalescing or more reasonable swizzle methods to remove the additional column in shared memory.
Can d_inner partitioning also be done in the CPU? I am not familiar with CPU calculations.

In addition, I used nsight compute to profile the perlexity test, and the image shows a slice computed up to a certain point in time (there are too many kernels, I don't have much time to run all of them). It can be seen that most of the time is now focused on matrix multiplication calculations, so the performance of the current scan kernel may be sufficient. (Or in the architectures of Ampere and Hopper, the matrix computation time can be further reduced, and this kernel has become a bottleneck again? I don't have these graphics cards, so I'm not sure.)
image

Finally, I would like to point out that this kernel only ensures correctness when d_inner% 128==0&&d_state==16 is true. Since I am not sure if there are any other additional situations, you can add extra security checks to ensure this, or if you really need to consider it, please let me know the specific size, and the code needs to be further modified.

@slaren
Copy link
Collaborator

slaren commented Dec 3, 2024

Finally, I would like to point out that this kernel only ensures correctness when d_inner% 128==0&&d_state==16 is true. Since I am not sure if there are any other additional situations, you can add extra security checks to ensure this, or if you really need to consider it, please let me know the specific size, and the code needs to be further modified.

This would be a good idea. Add GGML_ASSERT for any condition required by the kernel, and possibly to the supports_op function as well.

@A3shTnT
Copy link
Contributor Author

A3shTnT commented Dec 5, 2024

Faster ssm_conv has also been implemented. Should I open a new PR or just continue updating in this PR? @jploski

@jploski
Copy link
Contributor

jploski commented Dec 5, 2024

Faster ssm_conv has also been implemented. Should I open a new PR or just continue updating in this PR? @jploski

I think it makes sense to keep the entire "Mamba CUDA implementation" in a single PR. I believe the invidual commits will be squashed together upon merging into master, but it's ok as the history remains in the PR as reference. Note that I'm not a committer for llama.cpp, so it's just an opinion.

@A3shTnT
Copy link
Contributor Author

A3shTnT commented Dec 5, 2024

Add faster ssm conv implementation.
Here are some performance experiments:
perplexity:

Final estimate: PPL = 22.6059 +/- 0.17895

llama_perf_context_print:        load time =    1146.53 ms
llama_perf_context_print: prompt eval time =  181629.44 ms / 286720 tokens (    0.63 ms per token,  1578.60 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  187387.06 ms / 286721 tokens

cli

Once upon a time, there was a princess named Alma, and the princesses were so fond of her that they would give her a new dress if she would let them do so. The princesses wanted to marry her but she said no, they would just have to get used to having her. Eventually they married her and they had a child named Alma. The princesses were so fond of her that they would give her a new dress if they would let them do so. The princesses wanted to marry her but she said no, they would just have to get used to having her. Eventually they married her and she had a child named Alma

llama_perf_sampler_print:    sampling time =       4.94 ms /   133 runs   (    0.04 ms per token, 26950.35 tokens per second)
llama_perf_context_print:        load time =     969.24 ms
llama_perf_context_print: prompt eval time =      21.84 ms /     5 tokens (    4.37 ms per token,   228.99 tokens per second)
llama_perf_context_print:        eval time =    1013.43 ms /   127 runs   (    7.98 ms per token,   125.32 tokens per second)
llama_perf_context_print:       total time =    1052.83 ms /   132 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants