Skip to content

Latest commit

 

History

History
711 lines (653 loc) · 21.7 KB

README-en.md

File metadata and controls

711 lines (653 loc) · 21.7 KB

AlignBench: Benchmarking Chinese Alignment of Large Language Models

阅读中文

This repository contains information, data and code of AlignBench: a comprehensive multi-dimensional benchmark for evaluating LLMs’ alignment in Chinese.

🔥 Updates

[2023.12.12] AlignBench Website is now officially online, welcome everyone to visit! You can use the Submit function on the website to perform evaluation with CritiqueLLM on AlignBench (results can be obtained in about 5 minutes).

📍 Introduction

Alignment has become the critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, effective evaluation of alignment for emerging Chinese LLMs has become a significant challenge, calling for diverse, open-ended, challenging and automatic evaluation tailored for alignment. To address this, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs’ alignment in Chinese. Equipped with a human-in-the-loop data curation pipeline, our benchmark employs a multi-dimensional rules-calibrated LLM-as-Judge with Chain-of-Thought to generate an explanation and a final rating, ensuring high reliability and interpretability. Furthermore, we developed a dedicated accompanied evaluator LLM---CritiqueLLM, which could recover 95% of GPT-4's evaluation ability and will be provided via accessible APIs to researchers for convenient evaluation of Chinese alignment.

Overall

The overall framework of AlignBench is shown in the above image, including the data curation pipeline, the task taxonomy and the multi-dimensional rule-calibrated LLM-as-Judge evaluation method.

For a full description of AlignBench, please refer to the paper: AlignBench

For a full description of CritiqueLLM, please refer to the paper: CritiqueLLM


📦 Dataset

To perform a systematic evaluation, we framed a comprehensive taxonomy of the LLMs’ abilities based on the real-user instructions. We inspect and summarize user queries into 8 main categories, namely Fundamental Language Ability, Chinese Advanced Understanding, Open-ended Questions, Writing Ability, Logical Reasoning, Mathematics, Task-oriented Role Play. The taxonomy and distribution of AlignBench are as follows.

Category 中文名 #Samples
Fundamental Language Ability 基本任务 68
Advanced Chinese Understanding 中文理解 58
Open-ended Questions 综合问答 38
Writing Ability 文本写作 75
Logical Reasoning 逻辑推理 92
Mathematics 数学计算 112
Task-oriented Role Play 角色扮演 116
Professional Knowledge 专业能力 124

AlignBench contains 683 high-quality samples in total. Each sample in AlignBench contains a task-oriented query, a high-quality reference answer, and the corresponding category in our taxonomy. The data is placed in data/data_release.jsonl and each line contains a sample in json format.

The data format is as follows.

  • question_id (integer): A unique identifier for the question.
  • category (string): The primary category under which the question falls.
  • subcategory (string): The secondary category for further classification.
  • question (string): The actual user query.
  • reference (string): This provides a reference or standard answer to the question.

Here is an example of mathematics category.

{
    "question_id": 1,
    "category": "数学计算",
    "subcategory": "初等数学",
    "question": "有一串彩珠,按“2红3绿4黄”的顺序依次排列。第600颗是什么颜色?",
    "reference": "一组\"2红3绿4黄\"共有9颗珠子。600除以9的商是66,余数是6。因此,第600颗珠子是在第67组的第6颗,即\"2红3绿4黄\"中的第6颗,也就是黄色。所以,第600颗珠子是黄色。"
}

⚙️ Evaluation Pipeline

In order to effectively evaluate the quality of responses, AlignBench currently employs GPT-4-0613 to analyze and subsequently grade the responses. During the evaluation process, the input is the user query, the model's response, and a high-quality reference answer, and the output is an multi-dimensional analytical explanation and a final rating, ranging from 1 to 10. In order to ensure reliability and interpretability, we implement the following methods. Here is an example.

Case

  • Point-wise Grading. For each model answer, the evaluation methods will give a final rating ranging from 1 to 10.

  • Chain-of-Thought. As the task of grading involves complex reasoning, we have adopted the Chain-of-Thought method to augment both the reliability and interpretability. Specifically, the evaluator LLM is instructed to generate explanations from multiple dimensions before providing a final rating.

  • Rule-calibrated Referencing. For each question, we provide a high-quality reference answer. To guide the evaluator to compare the answer with the reference and generate more controllable scores, we provided detailed grading rules elaborating the relationship between score intervals and the answer's quality compared to the reference. The rules are included in the prompt.
  • Multi-dimensional Analysis. Because tasks vary in their nature and characteristics, applying the same evaluation criteria to all tasks would be unjust. As a solution, we suggest employing a multi-dimensional scoring approach to evaluate LLM's responses, tailoring the evaluation to the specific task at hand. Specifically, we set up different evaluation dimensions based on different types of questions and we instructed GPT-4 evaluator to analyze the model answer from specified dimensions and provide dimensional scores. The dimensions and their definitions are placed in config.

🚀 Evaluation

The whole evaluation process contains three steps: inference, LLM judgments and results display. The corresponding scripts are saved in scripts

  1. Step I inference on target LLM and get the results

    First, you need to deploy your target LLM.(This part is not included in this repository).

    Second, implement your own API calling class in inference/api_models, the do_nothing class sevres as an example. (Note that the API class name should be the same as the file name)

    Third, modify and run the following script to get the answers of the target LLM.

    MODEL=do_nothing # TODO modify the model name(the same as your API calling class)
    
    python get_answers.py \
        --model do_nothing \
        --workers 2 \
        --question-file data/data_release.jsonl \
        --save-dir data/model_answer

    The answers will be saved in data/model_answer and ready for the LLM Judge process.

  2. Step II get the GPT-4 judgments

    First, fill in your GPT-4 API key in config/multi-dimension.json.

    Then, modify and run the following script to get the judgments of the target LLM.

    MODEL=do_nothing # TODO modify the model name(the same as your API calling class)
    
    python judge.py \
        --config-path config/multi-dimension.json \
        --model-name $MODEL \
        --parallel 2 \

    The answers will be stored in data/jugdment

  3. Step III results display

    Run the following script to get the results of all the LLM judgments saved in data/judgment.

    python show_result.py \
        --input-dir data/judgment \
        --ques-file data/data_release.jsonl \
        --save-file data/results/results.xlsx

    The calulated resultss will be stored in data/results in xlsx format.


📂 Leaderboard

We report our evaluation results on 17 Chinese-supported LLMs on AlignBench using gpt-4-0613 and CritiqueLLM.

gpt-4-0613 judged results:

model Overall Reasoning 中文推理 Language 中文语言
Avg. Math. Logi. Avg. Fund. Chi. Open. Writ. Role. Pro.
模型 总分 推理
总分
数学
计算
逻辑
推理
语言
总分
基本
任务
中文
理解
综合
问答
文本
写作
角色
扮演
专业
能力
gpt-4-1106-preview 8.01 7.73 7.8 7.66 8.29 7.99 7.33 8.61 8.67 8.47 8.65
gpt-4-0613 7.53 7.47 7.56 7.37 7.59 7.81 6.93 7.42 7.93 7.51 7.94
chatglm-turbo(智谱清言) 6.24 5 4.74 5.26 7.49 6.82 7.17 8.16 7.77 7.76 7.24
erniebot-3.0(文心一言) 6.14 5.15 5.03 5.27 7.13 6.62 7.6 7.26 7.56 6.83 6.9
gpt-3.5-turbo-0613 6.08 5.35 5.68 5.02 6.82 6.71 5.81 7.29 7.03 7.28 6.77
chatglm-pro(智谱清言) 5.83 4.65 4.54 4.75 7.01 6.51 6.76 7.47 7.07 7.34 6.89
spark_desk_v2(讯飞星火) 5.74 4.73 4.71 4.74 6.76 5.84 6.97 7.29 7.18 6.92 6.34
qwen-14b-chat 5.72 4.81 4.91 4.71 6.63 6.9 6.36 6.74 6.64 6.59 6.56
baichuan2-13b-chat 5.25 3.92 3.76 4.07 6.59 6.22 6.05 7.11 6.97 6.75 6.43
chatglm3-6b 4.97 3.85 3.55 4.14 6.1 5.75 5.29 6.71 6.83 6.28 5.73
baichuan2-7b-chat 4.97 3.66 3.56 3.75 6.28 5.81 5.5 7.13 6.84 6.53 5.84
internlm-20b 4.96 3.66 3.39 3.92 6.26 5.96 5.5 7.18 6.19 6.49 6.22
qwen-7b-chat 4.91 3.73 3.62 3.83 6.09 6.4 5.74 6.26 6.31 6.19 5.66
chatglm2-6b 4.48 3.39 3.16 3.61 5.58 4.91 4.52 6.66 6.25 6.08 5.08
internlm-chat-7b 3.65 2.56 2.45 2.66 4.75 4.34 4.09 5.82 4.89 5.32 4.06
Chinese-llama-2-7b-chat 3.57 2.68 2.29 3.07 4.46 4.31 4.26 4.5 4.63 4.91 4.13
llama-2-13b-Chinese-chat 3.35 2.47 2.21 2.73 4.23 4.13 3.31 4.79 3.93 4.53 4.71

CritiqueLLM judged results:

model Overall Reasoning 中文推理 Language 中文语言
Avg. Math. Logi. Avg. Fund. Chi. Open. Writ. Role. Pro.
模型 总分 推理
总分
数学
计算
逻辑
推理
语言
总分
基本
任务
中文
理解
综合
问答
文本
写作
角色
扮演
专业
能力
gpt-4-1106-preview 7.58 7.11 7.39 6.83 8.05 7.69 7.07 8.66 8.23 8.08 8.55
gpt-4-0613 6.83 6.41 6.49 6.33 7.26 7.16 6.76 7.26 7.31 7.48 7.56
chatglm-turbo(智谱清言) 6.36 4.99 4.88 5.09 7.73 7.5 7.03 8.45 8.05 7.67 7.7
erniebot-3.0(文心一言) 5.91 4.75 4.34 5.15 7.07 6.46 7.21 7.29 7.73 7.03 6.72
chatglm-pro(智谱清言) 5.73 4.49 4.55 4.43 6.96 6.47 6.81 7.26 7.25 7.29 6.7
gpt-3.5-turbo-0613 5.68 4.85 4.90 4.79 6.52 6.01 5.6 6.97 7.27 6.98 6.29
spark_desk_v2(讯飞星火) 5.51 4.58 4.53 4.62 6.44 5.76 6.29 6.37 7.25 7.03 5.96
qwen-14b-chat 5.41 4.52 4.54 4.50 6.31 6.46 5.84 6.71 6.47 6.38 5.98
baichuan2-13b-chat 5.26 3.96 3.83 4.08 6.56 5.74 6.19 7.03 7.21 6.72 6.49
baichuan2-7b-chat 5.05 3.68 3.23 4.13 6.42 5.72 5.71 7.08 7.41 6.86 5.73
chatglm3-6b 5.01 3.70 3.44 3.95 6.33 6.13 5.72 6.92 7.11 6.31 5.77
internlm-20b 4.97 3.67 3.46 3.87 6.27 5.65 5.52 6.71 6.77 6.35 6.61
qwen-7b-chat 4.74 3.66 3.51 3.80 5.83 6.01 5.52 5.89 6.28 6.16 5.12
chatglm2-6b 4.57 3.32 3.28 3.35 5.83 5.24 5.12 6.68 6.83 5.95 5.15
Chinese-llama-2-7b-chat 3.44 2.42 2.13 2.70 4.46 4.59 4.29 4.39 4.64 4.91 3.94
internlm-chat-7b 3.24 2.10 2.34 1.85 4.39 3.43 3.76 5.37 4.63 5.01 4.15
llama-2-13b-Chinese-chat 3.14 2.35 2.12 2.58 3.93 4.31 2.9 4.34 3.52 4.04 4.47

👏 Citation

@misc{liu2023alignbench,
      title={AlignBench: Benchmarking Chinese Alignment of Large Language Models}, 
      author={Xiao Liu and Xuanyu Lei and Shengyuan Wang and Yue Huang and Zhuoer Feng and Bosi Wen and Jiale Cheng and Pei Ke and Yifan Xu and Weng Lam Tam and Xiaohan Zhang and Lichao Sun and Hongning Wang and Jing Zhang and Minlie Huang and Yuxiao Dong and Jie Tang},
      year={2023},
      eprint={2311.18743},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}