metal : gemma2 flash attention support #9159

slaren · 2024-08-24T20:29:22Z

Performance looks unchanged despite the new parameter, but I have only tested this with test-backend-ops.

ggml-ci

ggerganov · 2024-08-26T08:47:05Z

The following patch seems to fix the issue from #8542 on my Mac:

diff --git a/ggml/src/ggml-metal.metal b/ggml/src/ggml-metal.metal
index ab2de69c..aba0b9a0 100644
--- a/ggml/src/ggml-metal.metal
+++ b/ggml/src/ggml-metal.metal
@@ -2149,8 +2149,8 @@ kernel void kernel_flash_attn_ext_f16(
                     ss[8*cc + ty*TF + 2*tx + 1] *= scale;
 
                     if (logit_softcap != 0.0f) {
-                        ss[8*cc + ty*TF + 2*tx + 0] = logit_softcap*tanh(ss[8*cc + ty*TF + 2*tx + 0]);
-                        ss[8*cc + ty*TF + 2*tx + 1] = logit_softcap*tanh(ss[8*cc + ty*TF + 2*tx + 1]);
+                        ss[8*cc + ty*TF + 2*tx + 0] = logit_softcap*precise::tanh(ss[8*cc + ty*TF + 2*tx + 0]);
+                        ss[8*cc + ty*TF + 2*tx + 1] = logit_softcap*precise::tanh(ss[8*cc + ty*TF + 2*tx + 1]);
                     }
 
                     if (mask != q) {
@@ -2490,7 +2490,7 @@ kernel void kernel_flash_attn_ext_vec_f16(
                         mqk *= scale;
 
                         if (logit_softcap != 0.0f) {
-                            mqk = logit_softcap*tanh(mqk);
+                            mqk = logit_softcap*precise::tanh(mqk);
                         }
 
                         mqk += (mask != q) ? ((float4) mp4[ic/4 + cc])*slope : (float4) 0.0f;

slaren · 2024-08-26T08:51:36Z

Yep, also fixes it for me.

ggerganov

I don't observe performance changes as well, should be good to merge

metal : gemma2 flash attention support

054203a

ggml-ci

github-actions bot added the testing Everything test related label Aug 24, 2024

slaren mentioned this pull request Aug 25, 2024

CPU/CUDA: Gemma 2 FlashAttention support #8542

Merged

slaren marked this pull request as draft August 25, 2024 20:16

use precise::tanh

edc2e27

slaren marked this pull request as ready for review August 26, 2024 08:51

ggerganov approved these changes Aug 26, 2024

View reviewed changes

slaren merged commit 0c41e03 into master Aug 26, 2024
49 of 52 checks passed

slaren deleted the sl/metal-logit-softcap branch August 26, 2024 09:09

ggerganov mentioned this pull request Aug 26, 2024

metal : fix fa kernel #9187

Closed

4 tasks

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

metal : gemma2 flash attention support (ggerganov#9159)

e9f40aa

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

metal : gemma2 flash attention support (ggerganov#9159)

a7f67f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : gemma2 flash attention support #9159

metal : gemma2 flash attention support #9159

slaren commented Aug 24, 2024

ggerganov commented Aug 26, 2024

slaren commented Aug 26, 2024

ggerganov left a comment

metal : gemma2 flash attention support #9159

metal : gemma2 flash attention support #9159

Conversation

slaren commented Aug 24, 2024

ggerganov commented Aug 26, 2024

slaren commented Aug 26, 2024

ggerganov left a comment

Choose a reason for hiding this comment