🏆Leaderboard • 🔥Quick Start • 🐛Issues • 📜Citation

📣 About

Evaluation is the baton for the development of large language models. Current evaluations typically employ a single-item assessment paradigm for each atomic test objective, which struggles to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions. To this end, we propose a novel evaluation framework referred to as StructEval. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts, and therefore offers a comprehensive, robust and consistent evaluation for LLMs. Experiments on three widely-used benchmarks demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination and reducing the interference of potential biases, thereby providing more reliable and consistent conclusions regarding model capabilities. Our framework also sheds light on the design of future principled and trustworthy LLM evaluation protocols.

This repo provides easy-to-use scripts for both evaluating LLMs on existing StructEval benchmarks and generating new benchmarks based on StructEval framework.

📰 Read our paper StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation to get more information.

🚀 News

[2024.8.6] We released the first version of StructEval leaderboard, which includes 22 open-sourced language models, more datasets and models are comming soon🔥🔥🔥.
[2024.7.31] We regenerated the StructEval Benchmark based on the latest Wikipedia pages (20240601) using GPT-4o-mini model, which could minimize the impact of data contamination. Please refer to the struct_benchmark folder for our evaluation data and scripts 🔥🔥🔥.

🔥 Quick Start

✏️ Evaluate models on StructEval benchmarks

For facilitate evaluation, we have adapted StructEval to Opencompass 2.0, making it easy to quickly evaluate multiple models.

For instance, if you want to evaluate llama-3-8b-instruct model on the StructMMLU dataset, you just need import the corresponding model configuration in struct_benchmark/eval_config/eval_struct_mmlu_v1_instruct.py:

from mmengine.config import read_base
with read_base():
    from ..data_config.struct_mmlu.struct_mmlu_v1_instruct import struct_mmlu_V1_datasets
    from ..model_configs.hf_llama.hf_llama3_8b_instruct import models as hf_llama3_8b_instruct_model
datasets = [*struct_mmlu_V1_datasets]
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])

Then start the following command:

cd struct_benchmark
python run.py eval_config/eval_struct_mmlu_v1_instruct.py -w output/struct_mmlu_v1_instruct

The evaluation results will be saved in struct_benchmark/output/struct_mmlu_v1_instruct

Please refer to struct_benchmark/README.md for more detailed guidance.

🔨 Generate new benchmarks based on StructEval framework

The struct_generate folder provides the source code as well as an running example for benchmark construction based on StructEval. Specifically, StructEval consists of two modules which deepen and broaden current evaluation respectively. Given a seed instance, the first module identifies its underlying test objective, and then generates multiple test instances around this test objective which are aligned with the six cognitive levels outlined in Bloom’s Taxonomy. Meanwhile, the second module extracts the key concepts that must be understood to answer the seed question, and then develop a series of instances revolving around these concepts.

You can construct a structured benchmark for LLM evaluation based on some seed instances by executing the following command:

cd struct_generate
bash scripts/run_bloom_generate.bash demo test
bash scripts/run_concept_generation.bash demo test
bash scripts/run_data_combine.bash demo test

Please refer to struct_generate/README.md for more detailed guidance.

📜 Citation

@misc{cao2024structevaldeepenbroadenlarge,
      title={StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation}, 
      author={Boxi Cao and Mengjie Ren and Hongyu Lin and Xianpei Han and Feng Zhang and Junfeng Zhan and Le Sun},
      year={2024},
      eprint={2408.03281},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.03281}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

📣 About

🚀 News

🔥 Quick Start

✏️ Evaluate models on StructEval benchmarks

🔨 Generate new benchmarks based on StructEval framework

📜 Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

📣 About

🚀 News

🔥 Quick Start

✏️ Evaluate models on StructEval benchmarks

🔨 Generate new benchmarks based on StructEval framework

📜 Citation