How to evaluate chat-based models on examination datasets (e.g. C-Eval) properly? #674

Owen-Qin · 2023-12-07T10:25:24Z

Owen-Qin
Dec 7, 2023

Hi all, thanks for the great work!

I have a question about the evaluation of chat-based model (qwen-72b-chat).

As the OpenCompass leaderboard shows (https://opencompass.org.cn/leaderboard-llm), qwen-14b-chat got 71.7 acc on C-Eval dataset. I've checked its config file is"cmmlu_gen_c13365.py", which uses a 5-shot prompt.

When I do my own evaluation on qwen-72b-chat, I use the same config file, but the output is not as expected. Sometimes the output contains answers for all 5-shot sample questions plus the real question. Is it normal?

How to post-process the output in this case? Did you filter out the last answer (A, B, C, or D) as the prediction label?

Answered by Leymore

Dec 7, 2023

It seems that you are using the wrong qwen-72b-chat model config as the user & bot prompt is missing, plz try this:

https://github.com/open-compass/opencompass/blob/8798336b8593ce059ff0e54b2f2faf78c328bccc/configs/models/qwen/hf_qwen_72b_chat.py

View full answer

Leymore · 2023-12-07T10:29:34Z

Leymore
Dec 7, 2023

It seems that you are using the wrong qwen-72b-chat model config as the user & bot prompt is missing, plz try this:

https://github.com/open-compass/opencompass/blob/8798336b8593ce059ff0e54b2f2faf78c328bccc/configs/models/qwen/hf_qwen_72b_chat.py

1 reply

Owen-Qin Dec 8, 2023
Author

Thank you! Problem solved.
I was using FastChat to serve an OpenAI-like API for Qwen-72b-chat model. I just realized that evaluating API models also required meta_template to be set.
For those who encounter the same problem, plz check this document. It gives detailed explanations.
https://opencompass.org.cn/doc (Docs > Meta Template)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to evaluate chat-based models on examination datasets (e.g. C-Eval) properly? #674

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to evaluate chat-based models on examination datasets (e.g. C-Eval) properly? #674

Owen-Qin Dec 7, 2023

Replies: 1 comment · 1 reply

Leymore Dec 7, 2023

Owen-Qin Dec 8, 2023 Author

Owen-Qin
Dec 7, 2023

Replies: 1 comment 1 reply

Leymore
Dec 7, 2023

Owen-Qin Dec 8, 2023
Author