LLaMA-3.2 quantization evaluation #63
ikawrakow
started this conversation in
Show and tell
Replies: 2 comments
-
Here some performance numbers for the 1B model on a Ryzen-7950X CPU
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Here some performance numbers for the 3B model on a Ryzen-7950X CPU
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
LLaMA-3.2 is out.
llama.cpp
does not yet support the vision models, so this post focuses on the 1B ad 3B text models that could be very handy for local usage on low-end devices. The models are small enough even with full precision (bf16
) but I think it is still interesting to look at quantization as token generation is significantly faster with quantized models.To reproduce the results reported here
Perplexity
Perplexity (
PPL
in what follows) is not the best measure to compare different models, but it is extremely useful when comparing a quantized version of a model to the same full precision model. In the graphs below I use the quantization error defined aswhere
PPL(Q)
is the perplexity of quantizationQ
andPPL(bf16)
is the perplexity of the full model (the 3.2 models are released asbf16
, so I usebf16
throughout asbf16
support has been added here in PR #39, #40, #41, #56).The following graph shows quantization error of LLaMA-3.2-3B as a function of bits-per-weight (bpw) for (almost) all quantization types supported here. Note that this is the effective bpw that includes the
token_embedding.weight
tensor, which is quantized with more bits (typicallyQ6_K
), and this has a significant impact on the overall bpw balance as this tensor represents a significant fraction of the overall model size. The y-axis is logarithmic, so differences can be quite large even if data points look relatively close. The cyan circles are for the new quantsIQ2_K, IQ3_K, IQ4_K, IQ5_K
andIQ6_K
that are not available in mainlinellama.cpp
. The black symbols are for i-quants, the red for k-quants, and the blue symbols are legacy quants (Q4_0, Q4_1, Q5_0
, Q5_1`).The next graph shows results for LLaMA-3.2-3B-Instruct. The results are qualitatively very similar to the base model, with the quantization error being slightly lower compared to the base model.

My conclusion from these two graphs are
IQ4_K
andIQ5_K
are significantly better than k- or legacy quants in this bpw rangeThe next graph is for the base LLaMA-3.2-1B model
Here the quantization error is significantly larger, going below 2% only for 5+ bpw. At about 4.95 bpw
IQ4_K
has a quantization error of 3%,Q4_K_S
is at 4.3%, andQ4_0
at 12.5% (!), nearly the same asIQ3_K
at 3.68 bpw.HellaSwag
The HellaSwag 0-shot score of 74.34 for the 3B base model is surprisingly high for a model of this size. But here we are more interested in looking at the impact of quantization, so I'll focus on that. The following graph shows
for LLaMA-3.2-3B.
As one could have expected from the perplexity results, sub-3-bpw quantization destroys the model utility. Hence, it is more useful to focus on the 3+ bpw range, which is the purpose of the next graph
We see that
IQ4_K, IQ5_K, IQ6_K
andQ6_K
are basically indistinguishable from thebf16
model for the HellaSwag metrics. But at less than 2 points belowbf16
, evenIQ3_K
andIQ3_S
could be useful if HellaSwag is representative for the kind of tasks one intends to tackle.MMLU
Here I show only results for the 3+ bpw range for LLaMA-3.2-3B in the following graph
All quantizations above
IQ3_K
(3.6 bpw) are (nearly) indistinguishable from the fullbf16
model according to this metrics.Beta Was this translation helpful? Give feedback.
All reactions