Faster ssm scan #10558

A3shTnT · 2024-11-28T05:55:10Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

I wrote a faster ssm_scan compared with pr#9186. @jploski
Since there is no ssm cuda file in master branch, so i copied pr#9186 code, the only file that needs attention is ssm_scan.cu.
the debug and release build has passed ci test, since I cannot visit huggingface, so I didn't run the following part of ci.
the following is performance experiment：

perplexity

./llama-perplexity -m ~/program/mamba-130m-hf-f16.gguf -f ~/program/wikitext-2-raw/wiki.test.raw -ngl 99

my implementation:

Final estimate: PPL = 22.6059 +/- 0.17895

llama_perf_context_print:        load time =     931.91 ms
llama_perf_context_print: prompt eval time =  227378.97 ms / 286720 tokens (    0.79 ms per token,  1260.98 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  235974.70 ms / 286721 tokens

pr#9186:

Final estimate: PPL = 22.6059 +/- 0.17895

llama_perf_context_print:        load time =     859.59 ms
llama_perf_context_print: prompt eval time =  754629.16 ms / 286720 tokens (    2.63 ms per token,   379.95 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  770359.79 ms / 286721 tokens

cli

./llama-cli -m ~/program/mamba-130m-hf-f16.gguf -p "Once upon a time,"  --no-context-shift -n 128 -ngl 99

my implementation:

Once upon a time, when I was a child, and my grandmother was a young woman, a girl named Fannie, she had two children, and the one who was born after me was called Fannie. She was my first grandchild, and I was very fond of her. One day, I saw her little black eyes, and I said to her, "Fannie, you must be a very pretty girl. I have not seen you since I was a little girl." She said, "It is very strange, because I think I have seen a child of your age." I said, "Oh, you have been a very strange child,"

llama_perf_sampler_print:    sampling time =       6.25 ms /   133 runs   (    0.05 ms per token, 21276.60 tokens per second)
llama_perf_context_print:        load time =     831.55 ms
llama_perf_context_print: prompt eval time =      22.55 ms /     5 tokens (    4.51 ms per token,   221.73 tokens per second)
llama_perf_context_print:        eval time =    1140.36 ms /   127 runs   (    8.98 ms per token,   111.37 tokens per second)
llama_perf_context_print:       total time =    1186.22 ms /   132 tokens

pr#9186:

Once upon a time, there was a man who knew the secret of a great magic, and the secret of a great magic was to be born into the world, and the only way to make the magic was to use magic himself.

"The man who knew the secret of a great magic, and the secret of a great magic was to be born into the world, and the only way to make the magic was to use magic himself."

Now I'm getting into the mind-set of the "I think." And I think I've had this in my head since I was a kid. I've got it in my head that I'll

llama_perf_sampler_print:    sampling time =       7.16 ms /   133 runs   (    0.05 ms per token, 18575.42 tokens per second)
llama_perf_context_print:        load time =    1027.66 ms
llama_perf_context_print: prompt eval time =      35.86 ms /     5 tokens (    7.17 ms per token,   139.42 tokens per second)
llama_perf_context_print:        eval time =    1383.74 ms /   127 runs   (   10.90 ms per token,    91.78 tokens per second)
llama_perf_context_print:       total time =    1445.46 ms /   132 tokens

jploski · 2024-12-02T18:15:49Z

I tested this PR today. I can confirm (without understanding the implementation) that it is good:

does not break anything (perplexity with mamba-130m-hf, test-backend-ops, human evaluated generation with falcon-mamba-7b-instruct-Q5_K_M.gguf)
better performance than PR ggml:Mamba Cuda kernel performance improve #9186
more recent base commit than PR ggml:Mamba Cuda kernel performance improve #9186 and my branch https://github.com/jploski/llama.cpp/tree/falcon_mamba_cuda (created in late August)

The reformatting of the entire ssm_scan.cu is a bit unfortunate, as it makes the modifications vs. latest version from #9186 hard to follow; but the source code seems to have changed in sufficiently many places to warrant it.

So, unless the author wishes additional review, I would advise to disregard PR #9186 and to merge in this PR instead to finally get Mamba CUDA support into the official llama.cpp releases.

ggerganov · 2024-11-29T18:03:12Z

ggml/src/ggml-cuda/ggml-cuda.cu

+        case GGML_OP_SSM_CONV:
+          ggml_cuda_op_ssm_conv(ctx, dst);
+          break;
+        case GGML_OP_SSM_SCAN:
+          ggml_cuda_op_ssm_scan(ctx, dst);
+          break;


Suggested change

case GGML_OP_SSM_CONV:

ggml_cuda_op_ssm_conv(ctx, dst);

break;

case GGML_OP_SSM_SCAN:

ggml_cuda_op_ssm_scan(ctx, dst);

break;

case GGML_OP_SSM_CONV:

ggml_cuda_op_ssm_conv(ctx, dst);

break;

case GGML_OP_SSM_SCAN:

ggml_cuda_op_ssm_scan(ctx, dst);

break;

ggerganov · 2024-11-29T18:03:19Z

ggml/src/ggml-cuda/ggml-cuda.cu

-            return true;
+        case GGML_OP_SSM_SCAN:
+        case GGML_OP_SSM_CONV:
+          return true;


Suggested change

return true;

return true;

A3shTnT · 2024-12-03T01:43:53Z

I tested this PR today. I can confirm (without understanding the implementation) that it is good:

does not break anything (perplexity with mamba-130m-hf, test-backend-ops, human evaluated generation with falcon-mamba-7b-instruct-Q5_K_M.gguf)

better performance than PR ggml:Mamba Cuda kernel performance improve #9186

more recent base commit than PR ggml:Mamba Cuda kernel performance improve #9186 and my branch https://github.com/jploski/llama.cpp/tree/falcon_mamba_cuda (created in late August)

The reformatting of the entire ssm_scan.cu is a bit unfortunate, as it makes the modifications vs. latest version from #9186 hard to follow; but the source code seems to have changed in sufficiently many places to warrant it.

So, unless the author wishes additional review, I would advise to disregard PR #9186 and to merge in this PR instead to finally get Mamba CUDA support into the official llama.cpp releases.

I can provide some extra explanations for CUDA code. The biggest performance improvement comes from further partitioning, which divides d_inner into different blocks and increases the number of threads per block to 128. In addition, since A is repeatedly reading and hidden_state is repeatedly reading and writing, I placed them on shared memory to reduce the read and write of off chip storage. (During the inference process, L is usually 1, and this improvement is not significant. In the complexity test, L was 512, which may have some performance improvement.)

As for the thread partitioning within the block, the index calculation in the code may be quite complicated. In fact, each thread calculates one line of the partitioned hidden state. Perhaps there will be further optimization methods here, such as more reasonable memory coalescing or more reasonable swizzle methods to remove the additional column in shared memory.
Can d_inner partitioning also be done in the CPU? I am not familiar with CPU calculations.

In addition, I used nsight compute to profile the perlexity test, and the image shows a slice computed up to a certain point in time (there are too many kernels, I don't have much time to run all of them). It can be seen that most of the time is now focused on matrix multiplication calculations, so the performance of the current scan kernel may be sufficient. (Or in the architectures of Ampere and Hopper, the matrix computation time can be further reduced, and this kernel has become a bottleneck again? I don't have these graphics cards, so I'm not sure.)

Finally, I would like to point out that this kernel only ensures correctness when d_inner% 128==0&&d_state==16 is true. Since I am not sure if there are any other additional situations, you can add extra security checks to ensure this, or if you really need to consider it, please let me know the specific size, and the code needs to be further modified.

slaren · 2024-12-03T02:07:26Z

Finally, I would like to point out that this kernel only ensures correctness when d_inner% 128==0&&d_state==16 is true. Since I am not sure if there are any other additional situations, you can add extra security checks to ensure this, or if you really need to consider it, please let me know the specific size, and the code needs to be further modified.

This would be a good idea. Add GGML_ASSERT for any condition required by the kernel, and possibly to the supports_op function as well.

A3shTnT · 2024-12-05T01:10:32Z

Faster ssm_conv has also been implemented. Should I open a new PR or just continue updating in this PR? @jploski

jploski · 2024-12-05T01:33:16Z

Faster ssm_conv has also been implemented. Should I open a new PR or just continue updating in this PR? @jploski

I think it makes sense to keep the entire "Mamba CUDA implementation" in a single PR. I believe the invidual commits will be squashed together upon merging into master, but it's ok as the history remains in the PR as reference. Note that I'm not a committer for llama.cpp, so it's just an opinion.

A3shTnT · 2024-12-05T02:12:55Z

Add faster ssm conv implementation.
Here are some performance experiments：
perplexity：

Final estimate: PPL = 22.6059 +/- 0.17895

llama_perf_context_print:        load time =    1146.53 ms
llama_perf_context_print: prompt eval time =  181629.44 ms / 286720 tokens (    0.63 ms per token,  1578.60 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  187387.06 ms / 286721 tokens

cli

Once upon a time, there was a princess named Alma, and the princesses were so fond of her that they would give her a new dress if she would let them do so. The princesses wanted to marry her but she said no, they would just have to get used to having her. Eventually they married her and they had a child named Alma. The princesses were so fond of her that they would give her a new dress if they would let them do so. The princesses wanted to marry her but she said no, they would just have to get used to having her. Eventually they married her and she had a child named Alma

llama_perf_sampler_print:    sampling time =       4.94 ms /   133 runs   (    0.04 ms per token, 26950.35 tokens per second)
llama_perf_context_print:        load time =     969.24 ms
llama_perf_context_print: prompt eval time =      21.84 ms /     5 tokens (    4.37 ms per token,   228.99 tokens per second)
llama_perf_context_print:        eval time =    1013.43 ms /   127 runs   (    7.98 ms per token,   125.32 tokens per second)
llama_perf_context_print:       total time =    1052.83 ms /   132 tokens

A3shTnT added 2 commits November 28, 2024 13:15

faster ssm_scan

65180fb

delete unused commnet

6a6c954

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2024

A3shTnT closed this Nov 28, 2024

A3shTnT reopened this Nov 28, 2024

clang format

1e64567

jploski mentioned this pull request Dec 2, 2024

ggml:Mamba Cuda kernel performance improve #9186

Closed

4 tasks

ggerganov approved these changes Dec 2, 2024

View reviewed changes

add space

828e4f7

modify unnecessary calculations

e52a22d

faster ssm conv implementatioin

0dd48a6

ggerganov approved these changes Dec 5, 2024

View reviewed changes

A3shTnT mentioned this pull request Dec 12, 2024

[backend](cuda): faster uncontiguous concat #10760

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster ssm scan #10558

Faster ssm scan #10558

A3shTnT commented Nov 28, 2024 •

edited

Loading

jploski commented Dec 2, 2024

ggerganov Nov 29, 2024

ggerganov Nov 29, 2024

A3shTnT commented Dec 3, 2024

slaren commented Dec 3, 2024

A3shTnT commented Dec 5, 2024 •

edited

Loading

jploski commented Dec 5, 2024

A3shTnT commented Dec 5, 2024

Faster ssm scan #10558

Are you sure you want to change the base?

Faster ssm scan #10558

Conversation

A3shTnT commented Nov 28, 2024 • edited Loading

perplexity

cli

jploski commented Dec 2, 2024

ggerganov Nov 29, 2024

Choose a reason for hiding this comment

ggerganov Nov 29, 2024

Choose a reason for hiding this comment

A3shTnT commented Dec 3, 2024

slaren commented Dec 3, 2024

A3shTnT commented Dec 5, 2024 • edited Loading

jploski commented Dec 5, 2024

A3shTnT commented Dec 5, 2024

A3shTnT commented Nov 28, 2024 •

edited

Loading

A3shTnT commented Dec 5, 2024 •

edited

Loading