This project provides a unified framework to test language models (GPT-2, GPT-3, GPTNeo, etc) and seq2seq (T5, T0) models via prompt evaluation.
As of now, all the prompts are provided via the promptsource
eval-hackathon branch; all datasets are from huggingface datasets
.
This fork is not backwards compatible with the original evaluation harness.
git clone https://github.com/bigscience-workshop/lm-evaluation-harness
cd lm-evaluation-harness
pip install "promptsource @ git+https://github.com/bigscience-workshop/promptsource@eval-hackathon"
pip install -e ".[dev]"
To evaluate a model (e.g. GPT-2) on NLP tasks such as SuperGLUE, you can run the following command.
python main.py \
--model hf-causal \
--model_args pretrained=gpt2 \
--tasks wic,copa
Additional arguments can be provided to the model constructor using the --model_args
flag. For larger models supported by HuggingFace transformers
, we provide parallelism and mixed-precision utilities through the accelerate
package. It can be activated for hf-causal
/hf-seq2seq
by passing use_accelerate=True
and dtype=half
to the --model_args
flag, respectively. For finer grained control over accelerate
options, see the constructor docstrings for HuggingFaceAutoLM
in huggingface.py
.
python main.py \
--model hf-causal \
--model_args use_accelerate=True,pretrained=facebook/opt-13b \
--tasks wnli \
If you have access to the OpenAI API, you can also evaluate GPT-3 engines:
export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
python main.py \
--model openai \
--model_args engine=davinci \
--tasks hans
When reporting results from eval harness, please include the task versions (shown in results["versions"]
) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the Task Versioning section for more info.
Features:
- Growing number of tasks integrated with
promptsource
(20+). - Support for hugging face causal language models, huggingface Seq2seq models, and the openai completion api (gpt3), with flexible tokenization-agnostic interface
- Task versioning to ensure reproducibility
To implement a new task in eval harness, follow the PromptSourceTask
template.
@software{eval-harness,
author = {Gao, Leo and
Tow, Jonathan and
Biderman, Stella and
Black, Sid and
DiPofi, Anthony and
Foster, Charles and
Golding, Laurence and
Hsu, Jeffrey and
McDonell, Kyle and
Muennighoff, Niklas and
Phang, Jason and
Reynolds, Laria and
Tang, Eric and
Thite, Anish and
Wang, Ben and
Wang, Kevin and
Zou, Andy},
title = {A framework for few-shot language model evaluation},
month = sep,
year = 2021,
publisher = {Zenodo},
version = {v0.0.1},
doi = {10.5281/zenodo.5371628},
url = {https://doi.org/10.5281/zenodo.5371628}
}