-
以选择题为例 问题:小白是什么动物?A. 老鼠 B. 牛 C. 老虎 D. 兔子 ppl (perplexity) 就是给模型 4 句话:
gen 和 ppl 最终都是得到 A / B / C / D 之一,与参考答案进行比较,得分或者不得分等等 上面是背景,下面是问题: English description is following: |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
The scores of ppl and gen in multiple-choice questions are not necessarily the same theoretically. This is because LM is doing next token prediction, where the choice range for ppl's next token is only A / B / C / D, while for gen's next token, the range is the entire vocabulary. |
Beta Was this translation helpful? Give feedback.
-
ppl can represent the model's ability in multiple-choice questions, on this dataset, under this mode of use. Whether this ability can be extrapolated to other capabilities, or even used to generally discuss whether a model is good or bad, depends on your value orientation. |
Beta Was this translation helpful? Give feedback.
The scores of ppl and gen in multiple-choice questions are not necessarily the same theoretically. This is because LM is doing next token prediction, where the choice range for ppl's next token is only A / B / C / D, while for gen's next token, the range is the entire vocabulary.
When the model's instruction-following ability is weak, it may not be able to output A / B / C / D; or when the model is fine-tuned in a tricky way, it might output a long explanation first, followed by A / B / C / D. These factors can lead to differences in the extracted results, and therefore, differences in accuracy.