Open-LLM-Leaderboard: Open-Style Question Evaluation

We introduce the Open-LLM-Leaderboard to track various LLMs’ performance on open-style questions and reflect their true capability. You can use OSQ-bench questions and prompts to evaluate your models automatically with an LLM-based evaluator. The leaderboard is available for viewing on HuggingFace.

Pre-Generated Model Answers and Evaluation

We provide pre-generated model answers and evaluation for models. They can be downloaded using the Huggingface dataset. You can also view them at Google Drive.

import datasets
gpt4_responses = datasets.load_dataset("Open-Style/Open-LLM-Benchmark", "gpt4")

Each data point is represented as the following:

{
  "question": "What is the main function of photosynthetic cells within a plant?",
  "gold_answer": "to convert energy from sunlight into food energy",
  "os_answer": "The main function of photosynthetic cells ...",
  "os_eval": "Correct",
  "mcq_answer": "C",
  "mcq_eval": true,
  "dataset": "ARC"
}

OSQ-Bench

OSQ-bench is a set of questions from datasets MMLU, ARC, WinoGrande, PIQA, CommonsenseQA, Race, MedMCQA, and OpenbookQA that are suitable for open-style answering. To automate the evaluation process, we use LLMs like GPT-4 to act as evaluators and assess the quality of the models' responses.

Evaluate a model on OSQ-bench

Step 1. Generate model answers to OSQ-bench questions

To evaluate a model you need to:

Download the benchmark and generate the answers. You can use the Huggingface dataset to download it:

import datasets
import json

eval_set = datasets.load_dataset("Open-Style/Open-LLM-Benchmark", "questions")
grouped_responses = []
for example in eval_set['train']:
    # generate here is a placeholder for your models generations
    response = {"Question": example["question"], "os_answer": generate(example["question"]), "dataset": example["dataset"]}
    dataset = example["dataset"]
    if dataset not in grouped_responses:
        grouped_responses[dataset] = []
    grouped_responses[dataset].append(response)

Or lm-evaluation-harness can be used to generate the answers. To use it first run: pip install lm-eval. Then run the following for the tasks in lm-eval-tasks folder:

lm_eval \
    --model hf \
    --model_args pretrained=[MODEL-NAME] \
    --tasks os_mmlu \
    --device cuda:0 \
    --num_fewshot 0 \
    --include_path ./ \
    --batch_size auto \
    --output_path mmlu.jsonl \
    --log_samples \
    --predict_only

Step 2. Generate GPT-4 evaluation

In this step, we ask GPT-4 to grade the model's answer by comparing it to the correct answer from the benchmark. For each turn, GPT-4 will give the answer 'Correct' or 'Incorrect'. We then compute the average score on all turns.

export OPENAI_API_KEY=XXXXXX  # set the OpenAI API key
python evaluate.py --model [MODEL-NAME] --parallel [num-concurrent-api-call]

e.g.,

python evaluate.py --model gpt4o --parallel 2

The evaluation will be saved to evaluations/{model}/{dataset}.json

Contributing a model

We are accepting PRs for new models. We will update the leaderboard with new models. Please follow the steps in Evaluate a model on OSQ-bench to run inference on the model and produce outputs on the benchmark. You can evaluate the model with GPT-4 or submit the outputs by the following link.

Leaderboards

Our leaderboards are based on the OSQ-bench. We have 2 (Large and Small Models) leaderboards based on the scale of LLMs.

Large-Scale LLMs leaderboard:

Model	Overall	MMLU	ARC	WG	PIQA	CSQA	Race	MedMCQA	OBQA
GPT-4o-2024-05-13	70.15	79.09	86.31	72.22	60.34	70.28	67.87	57.85	67.21
GPT-4-1106-preview	65.93	74.77	82.68	66.22	61.64	62.96	67.05	51.81	60.29
Claude-3 Opus	62.53	70.23	75.47	63.54	59.05	63.66	66.22	49.14	52.95
Mistral Large	60.84	68.76	72.32	56.83	61.21	55.35	70.17	43.44	58.66
GPT-3.5	60.32	65.38	78.42	64.56	54.89	67.89	60.11	41.42	49.90
Gemini 1.0 Pro	54.06	56.04	72.35	56.35	47.70	50.56	61.02	35.89	52.55
Llama3-70b-Instruct	52.92	59.67	67.09	57.14	43.10	55.49	58.21	41.67	40.94

Small-Scale LLMS leaderboard:

Model	Overall	MMLU	ARC	WG	PIQA	CSQA	Race	MedMCQA	OBQA
Qwen1.5 (1.8B)	21.68	9.99	15.84	40.96	15.52	31.13	34.91	4.70	20.37
Gemma (2B)	16.66	17.52	23.93	16.10	15.09	27.46	14.32	4.57	14.26
SlimPajama-DC (1.3B)	9.60	9.22	14.95	14.76	5.32	9.01	16.19	1.68	5.70
RedPajama (1.3B)	9.00	9.21	13.50	16.97	0.86	11.41	14.35	1.86	3.87
OLMo (1.2B)	8.85	8.54	13.18	6.16	8.05	13.10	13.61	2.07	6.11
Pythia (1.4B)	8.79	9.66	14.69	11.52	4.17	9.01	12.76	3.19	5.30
TinyLlama (1.1B)	8.45	8.94	13.31	12.23	3.59	6.06	16.70	2.07	4.68
OPT (1.3B)	7.89	7.40	11.83	12.47	4.48	7.61	13.61	1.25	4.48
GPT-Neo (1.3B)	7.42	6.94	9.69	10.81	4.31	6.34	13.75	2.63	4.89
Cerebras-GPT (1.3B)	4.86	5.37	4.43	9.31	2.16	6.20	6.90	1.04	3.46

Citation

@article{myrzakhan2024openllmleaderboard,
  title={Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena},
  author={Aidar Myrzakhan, Sondos Mahmoud Bsharat, Zhiqiang Shen},
  journal={arXiv preprint arXiv:2406.07545},
  year={2024},
}

Acknowledgments

We extend our deepest gratitude to the authors and contributors of the following datasets: MMLU, ARC, WinoGrande, PIQA, CommonsenseQA, Race, MedMCQA, OpenbookQA, and Hellaswag.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
lm-eval-tasks		lm-eval-tasks
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
evaluate.py		evaluate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open-LLM-Leaderboard: Open-Style Question Evaluation

Contents

Pre-Generated Model Answers and Evaluation

OSQ-Bench

Evaluate a model on OSQ-bench

Step 1. Generate model answers to OSQ-bench questions

Step 2. Generate GPT-4 evaluation

Contributing a model

Leaderboards

Citation

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

License

VILA-Lab/Open-LLM-Leaderboard

Folders and files

Latest commit

History

Repository files navigation

Open-LLM-Leaderboard: Open-Style Question Evaluation

Contents

Pre-Generated Model Answers and Evaluation

OSQ-Bench

Evaluate a model on OSQ-bench

Step 1. Generate model answers to OSQ-bench questions

Step 2. Generate GPT-4 evaluation

Contributing a model

Leaderboards

Citation

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages