llama-bench : use random tokens to improve accuracy with mixtral #6069

slaren · 2024-03-14T23:15:31Z

llama-bench currently does not produce accurate results with mixtral because it uses the same token for the entire prompt (bos). This results in the same experts being chosen repeatedly, which is not what happens during real usage. With this change llama-bench uses random tokens instead.

Current llama-bench results in master:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	0	pp 512	189.62 ± 0.97
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	0	pp 1024	182.17 ± 0.54
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	99	pp 512	613.36 ± 0.81
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	99	pp 1024	607.84 ± 0.48

build: 4755afd (2431)

Using main with a large representative prompt (extracted from the frankenstein book text) produces these values instead:

With -ngl 0:

llama_print_timings: prompt eval time =    8695.65 ms /   512 tokens (   16.98 ms per token,    58.88 tokens per second)
llama_print_timings: prompt eval time =   17340.24 ms /  1024 tokens (   16.93 ms per token,    59.05 tokens per second)

With -ngl 99:

llama_print_timings: prompt eval time =    1411.63 ms /   512 tokens (    2.76 ms per token,   362.70 tokens per second)
llama_print_timings: prompt eval time =    2811.67 ms /  1024 tokens (    2.75 ms per token,   364.20 tokens per second)

llama-bench after this PR:

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	0	pp 512	61.87 ± 0.26
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	0	pp 1024	61.56 ± 0.11
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	99	pp 512	378.69 ± 0.87
llama 7B Q3_K - Large	19.03 GiB	46.70 B	CUDA	99	pp 1024	377.54 ± 1.76

The small difference is probably due to the warmup run performed by llama-bench.

Why is this important: a future change will cause all experts to be copied to VRAM during prompt processing regardless of if they are actually used, while currently only the experts used are copied. This change is important to understand the performance impact of doing that.

ggerganov

M2 Ultra

master

model	size	params	backend	ngl	test	t/s
llama 7B F16	86.99 GiB	46.70 B	Metal	99	pp 512	302.32 ± 0.54
llama 7B F16	86.99 GiB	46.70 B	Metal	99	pp 1024	301.49 ± 0.12

build: 4755afd (2431)

PR

model	size	params	backend	ngl	test	t/s
llama 7B F16	86.99 GiB	46.70 B	Metal	99	pp 512	275.43 ± 1.19
llama 7B F16	86.99 GiB	46.70 B	Metal	99	pp 1024	279.04 ± 0.67

build: 8281389 (2432)

…rganov#6069)

llama-bench : use random tokens to improve accuracy with mixtral

8281389

ggerganov approved these changes Mar 15, 2024

View reviewed changes

ggerganov merged commit b0bc9f4 into master Mar 15, 2024
58 of 63 checks passed

slaren deleted the sl/bench-random-tokens branch March 15, 2024 10:46

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

llama-bench : use random tokens to improve accuracy with mixtral (gge…

7b6be7b

…rganov#6069)

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024

llama-bench : use random tokens to improve accuracy with mixtral (gge…

b7c905e

…rganov#6069)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-bench : use random tokens to improve accuracy with mixtral #6069

llama-bench : use random tokens to improve accuracy with mixtral #6069

slaren commented Mar 14, 2024 •

edited

Loading

ggerganov left a comment

llama-bench : use random tokens to improve accuracy with mixtral #6069

llama-bench : use random tokens to improve accuracy with mixtral #6069

Conversation

slaren commented Mar 14, 2024 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

slaren commented Mar 14, 2024 •

edited

Loading