Issue about reproducing results in some datasets #2

ToheartZhang · 2024-05-13T03:57:21Z

Thanks for your great work! I clone the math_eval directory and run run_7B_plus.sh directly, and find some performance gaps in some datasets.

Model	TheoremQA	GPQA	MMLU STEM	BBH	ARC-C	MATH	GSM8k
MAmmoTH2-7B-Plus (reported)	29.2	36.8	65.7	63.1	83	45	84.7
MAmmoTH2-7B-Plus (reproduced)	26.75	31.31	64.29	63.6	83.02	44.32	83.4

My environment is:

vllm                      0.2.6
torch                     2.1.2
transformers              4.40.0

Am I missing something? Thanks for your help!

The text was updated successfully, but these errors were encountered:

wenhuchen · 2024-05-13T12:47:37Z

wenhuchen · 2024-05-13T15:15:32Z

It seems that this is mainly due to your lower version of vllm. Try to upgrade that to reproduce it. Thanks!

ToheartZhang · 2024-05-21T03:46:13Z

Thanks for your help! Here are my updated results with the new vllm version. I think the GPQA dataset is a little unstable.

Model	TheoremQA	GPQA	MMLU STEM	BBH	ARC-C	MATH	GSM8k
MAmmoTH2-7B-Plus (reported)	29.2	36.8	65.7	63.1	83	45	84.7
MAmmoTH2-7B-Plus (reproduced)	28.88	30.81	64.58	63.05	82.68	44.42	85.06

wenhuchen · 2024-05-21T04:07:02Z

Thanks! Would you mind trying our updated ckpt. It's getting better results. Please refer to https://huggingface.co/TIGER-Lab/MAmmoTH2-7B-Plus.

wenhuchen closed this as completed May 19, 2024

wenhuchen reopened this May 21, 2024

Provide feedback