Code models score low on humaneval dataset #1067

PJY-coder · 2024-04-22T09:42:38Z

PJY-coder
Apr 22, 2024

我使用opencompass测试了starcoder2，wizardcoder，codellama等代码大模型，在humaneval上的评分非常低(小于10)，为什么它们的预处理方式都是一样的呢？不同的代码大模型前后处理好像差别还挺大的吧？

请问大家有遇到这个情况吗？

Answered by jingmingzhuo

Apr 30, 2024

The prompt in humaneval_gen_4a6eef is the same one in the official repo of wizardcoder with may align with the training phase.
The suffix represents the md5 encoding of the prompt.
We apologize for this confusion and will provide greater clarity.

View full answer

jingmingzhuo · 2024-04-22T10:32:03Z

jingmingzhuo
Apr 22, 2024

Could you provide the configs of models and datasets for testing? Thanks!

7 replies

PJY-coder Apr 26, 2024
Author

thx,I'll try starcoder2 again.
How about wizardcoder-15b?

jingmingzhuo Apr 27, 2024

Sorry for taking so long to respond.

I checked the config of wizardcoer and and found that it lacks a meta template. Here is a modified version:

from opencompass.models import HuggingFaceCausalLM

_meta_template = dict(
    round=[
        dict(role="HUMAN", begin='### Instruction:\n', end='\n\n'),
        dict(role="BOT", begin="### Response:\n", end='</s>', generate=True),
    ],
)

models = [
    # WizardCoder 15B
    dict(
        type=HuggingFaceCausalLM,
        abbr='WizardCoder-15B-V1.0',
        path="WizardLM/WizardCoder-15B-V1.0",
        tokenizer_path='WizardLM/WizardCoder-15B-V1.0',
        tokenizer_kwargs=dict(
            padding_side='left',
            truncation_side='left',
            trust_remote_code=True,
        ),
        meta_template=_meta_template,
        max_out_len=1024,
        max_seq_len=2048,
        batch_size=8,
        model_kwargs=dict(trust_remote_code=True, device_map='auto'),
        run_cfg=dict(num_gpus=2, num_procs=1),
    ),
]

In my testing, its pass@1 score under humaneval_gen is 47.56.

PJY-coder Apr 28, 2024
Author

Thanks for answering, but I think the template u provided is not correct.

According to here the pass@1 score of humaneval should be around 55. And I further discovered that humaneval_gen_4a6eef works well, which is same with this.

In my testing, its pass@1 score under humaneval_gen is 54.27 with accelerator vllm, 52.43 without accelerator vllm. Even though the number are acceptable, but there still exists a gap.

At last, here is my confusion: 1. Why does opencompass use so weird way to name humaneval template. 2. It is hard and time-consuming for users to tell which template is correct(no detail about it in default config file either). It is not friendly.

jingmingzhuo Apr 30, 2024

The prompt in humaneval_gen_4a6eef is the same one in the official repo of wizardcoder with may align with the training phase.
The suffix represents the md5 encoding of the prompt.
We apologize for this confusion and will provide greater clarity.

Answer selected by PJY-coder

PJY-coder Apr 30, 2024
Author

Thanks very much！No further question.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code models score low on humaneval dataset #1067

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Code models score low on humaneval dataset #1067

PJY-coder Apr 22, 2024

Replies: 1 comment · 7 replies

jingmingzhuo Apr 22, 2024

PJY-coder Apr 26, 2024 Author

jingmingzhuo Apr 27, 2024

PJY-coder Apr 28, 2024 Author

jingmingzhuo Apr 30, 2024

PJY-coder Apr 30, 2024 Author

PJY-coder
Apr 22, 2024

Replies: 1 comment 7 replies

jingmingzhuo
Apr 22, 2024

PJY-coder Apr 26, 2024
Author

PJY-coder Apr 28, 2024
Author

PJY-coder Apr 30, 2024
Author