(最新功能)llama.cpp中启用Flash Attention进行推理加速 #13
ymcui
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
llama.cpp 在 commit (ggerganov/llama.cpp@9c67c2773d4b706cf71d70ecf4aa180b62501960) 中引入了Flash Attention,可进一步加速推理。
启用方法:
./main
加入-fa
参数即可。以下是Llama-3-Chinese-8B-Instruct上的测试结果(Apple M3 Max),最后一列为速度(token/s,越大越快)。
未开启Flash Attention:
开启Flash Attention:
build: a68a1e7e (2772)
PPL测试方面无显著差异。以下是Q8_0结果。
Beta Was this translation helpful? Give feedback.
All reactions