（最新功能）llama.cpp中启用Flash Attention进行推理加速 #13

ymcui · 2024-04-30T10:41:12Z

ymcui
Apr 30, 2024
Maintainer

llama.cpp 在 commit (ggerganov/llama.cpp@9c67c2773d4b706cf71d70ecf4aa180b62501960） 中引入了Flash Attention，可进一步加速推理。

启用方法：./main加入-fa参数即可。

以下是Llama-3-Chinese-8B-Instruct上的测试结果（Apple M3 Max），最后一列为速度（token/s，越大越快）。

未开启Flash Attention：

model	size	params	backend	ngl	test	t/s
llama 8B F16	14.96 GiB	8.03 B	Metal	99	pp 512	764.53 ± 0.65
llama 8B F16	14.96 GiB	8.03 B	Metal	99	tg 128	22.56 ± 0.02
llama 8B Q8_0	7.95 GiB	8.03 B	Metal	99	pp 512	743.59 ± 0.82
llama 8B Q8_0	7.95 GiB	8.03 B	Metal	99	tg 128	38.08 ± 0.19
llama 8B Q4_0	4.33 GiB	8.03 B	Metal	99	pp 512	683.25 ± 7.56
llama 8B Q4_0	4.33 GiB	8.03 B	Metal	99	tg 128	59.55 ± 0.10

开启Flash Attention：

model	size	params	backend	ngl	fa	test	t/s
llama 8B F16	14.96 GiB	8.03 B	Metal	99	1	pp 512	799.12 ± 0.98
llama 8B F16	14.96 GiB	8.03 B	Metal	99	1	tg 128	22.98 ± 0.06
llama 8B Q8_0	7.95 GiB	8.03 B	Metal	99	1	pp 512	777.29 ± 0.40
llama 8B Q8_0	7.95 GiB	8.03 B	Metal	99	1	tg 128	40.03 ± 0.05
llama 8B Q4_0	4.33 GiB	8.03 B	Metal	99	1	pp 512	780.70 ± 0.24
llama 8B Q4_0	4.33 GiB	8.03 B	Metal	99	1	tg 128	64.20 ± 0.19

build: a68a1e7e (2772)

PPL测试方面无显著差异。以下是Q8_0结果。

	PPL
关闭Flash Attention	4.8779 +/- 0.05386
开启Flash Attention	4.8776 +/- 0.05386