Bug: Gemma 2 slower with FA #9243

Azirine · 2024-08-29T16:39:59Z

What happened?

Gemma 2 is slower with FA on Apple Silicon (M3 Max).

Name and Version

version: 3642 (1d1ccce)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0

What operating system are you seeing the problem on?

Mac

Relevant log output

| model                          |       size |     params | backend    | ngl | fa | mmap |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | ------------: | ---------------: |
| gemma2 2B Q8_0                 |   3.17 GiB |     3.20 B | Metal      |  99 |  0 |    0 |         pp512 |   2360.42 ± 3.71 |
| gemma2 2B Q8_0                 |   3.17 GiB |     3.20 B | Metal      |  99 |  0 |    0 |          tg64 |     85.54 ± 0.05 |
| gemma2 2B Q8_0                 |   3.17 GiB |     3.20 B | Metal      |  99 |  1 |    0 |         pp512 |   1487.45 ± 3.27 |
| gemma2 2B Q8_0                 |   3.17 GiB |     3.20 B | Metal      |  99 |  1 |    0 |          tg64 |     50.99 ± 0.17 |
| gemma2 9B Q8_0                 |  10.05 GiB |    10.16 B | Metal      |  99 |  0 |    0 |         pp512 |    608.84 ± 0.96 |
| gemma2 9B Q8_0                 |  10.05 GiB |    10.16 B | Metal      |  99 |  0 |    0 |          tg64 |     30.29 ± 0.04 |
| gemma2 9B Q8_0                 |  10.05 GiB |    10.16 B | Metal      |  99 |  1 |    0 |         pp512 |   397.25 ± 23.27 |
| gemma2 9B Q8_0                 |  10.05 GiB |    10.16 B | Metal      |  99 |  1 |    0 |          tg64 |     21.33 ± 0.01 |

build: 1d1ccce6 (3642)

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-08-30T08:54:08Z

For head size = 256, which is the case for Gemma2-2B, the Metal flash attention kernel is slow (#7261), so it is disabled (#7556). This makes the attention operation to run on the CPU

Dampfinchen · 2024-09-03T21:30:09Z

This also happens on CUDA. Benchmarks here. #8542 (comment)

sais-github · 2024-09-23T14:25:19Z

When using -ctk & -ctv context is processed by the cpu instead of the gpu when i try it. I'd say that's what is causing the slow down with cuda 😸

github-actions · 2024-11-08T01:07:24Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Azirine added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Aug 29, 2024

JohannesGaessler added the Apple Metal https://en.wikipedia.org/wiki/Metal_(API) label Aug 29, 2024

github-actions bot added the stale label Oct 24, 2024

github-actions bot closed this as completed Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Gemma 2 slower with FA #9243

Bug: Gemma 2 slower with FA #9243

Azirine commented Aug 29, 2024

ggerganov commented Aug 30, 2024 •

edited

Loading

Dampfinchen commented Sep 3, 2024

sais-github commented Sep 23, 2024

github-actions bot commented Nov 8, 2024

Bug: Gemma 2 slower with FA #9243

Bug: Gemma 2 slower with FA #9243

Comments

Azirine commented Aug 29, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

ggerganov commented Aug 30, 2024 • edited Loading

Dampfinchen commented Sep 3, 2024

sais-github commented Sep 23, 2024

github-actions bot commented Nov 8, 2024

ggerganov commented Aug 30, 2024 •

edited

Loading