Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Gemma 2 slower with FA #9243

Closed
Azirine opened this issue Aug 29, 2024 · 4 comments
Closed

Bug: Gemma 2 slower with FA #9243

Azirine opened this issue Aug 29, 2024 · 4 comments
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale

Comments

@Azirine
Copy link

Azirine commented Aug 29, 2024

What happened?

Gemma 2 is slower with FA on Apple Silicon (M3 Max).

Name and Version

version: 3642 (1d1ccce)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0

What operating system are you seeing the problem on?

Mac

Relevant log output

| model                          |       size |     params | backend    | ngl | fa | mmap |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | ------------: | ---------------: |
| gemma2 2B Q8_0                 |   3.17 GiB |     3.20 B | Metal      |  99 |  0 |    0 |         pp512 |   2360.42 ± 3.71 |
| gemma2 2B Q8_0                 |   3.17 GiB |     3.20 B | Metal      |  99 |  0 |    0 |          tg64 |     85.54 ± 0.05 |
| gemma2 2B Q8_0                 |   3.17 GiB |     3.20 B | Metal      |  99 |  1 |    0 |         pp512 |   1487.45 ± 3.27 |
| gemma2 2B Q8_0                 |   3.17 GiB |     3.20 B | Metal      |  99 |  1 |    0 |          tg64 |     50.99 ± 0.17 |
| gemma2 9B Q8_0                 |  10.05 GiB |    10.16 B | Metal      |  99 |  0 |    0 |         pp512 |    608.84 ± 0.96 |
| gemma2 9B Q8_0                 |  10.05 GiB |    10.16 B | Metal      |  99 |  0 |    0 |          tg64 |     30.29 ± 0.04 |
| gemma2 9B Q8_0                 |  10.05 GiB |    10.16 B | Metal      |  99 |  1 |    0 |         pp512 |   397.25 ± 23.27 |
| gemma2 9B Q8_0                 |  10.05 GiB |    10.16 B | Metal      |  99 |  1 |    0 |          tg64 |     21.33 ± 0.01 |

build: 1d1ccce6 (3642)
@Azirine Azirine added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Aug 29, 2024
@JohannesGaessler JohannesGaessler added the Apple Metal https://en.wikipedia.org/wiki/Metal_(API) label Aug 29, 2024
@ggerganov
Copy link
Owner

ggerganov commented Aug 30, 2024

For head size = 256, which is the case for Gemma2-2B, the Metal flash attention kernel is slow (#7261), so it is disabled (#7556). This makes the attention operation to run on the CPU

@Dampfinchen
Copy link

This also happens on CUDA. Benchmarks here. #8542 (comment)

@sais-github
Copy link

When using -ctk & -ctv context is processed by the cpu instead of the gpu when i try it. I'd say that's what is causing the slow down with cuda 😸

@github-actions github-actions bot added the stale label Oct 24, 2024
Copy link
Contributor

github-actions bot commented Nov 8, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale
Projects
None yet
Development

No branches or pull requests

5 participants