You can refer to the structure of the src/eval_suite/benchs/exampleqa
folder, which serves as a minimal benchmark example. Additionally, you might want to check the src/eval_suite/benchs/base_dataset.py
and src/eval_suite/benchs/base_evaluator.py
files, as they provide the base classes for benchmarks.
-
Creating a Benchmark Folder
- Create a new folder under the
benchs
directory. - The folder should contain the dataset, evaluator, and any other necessary files.
- The folder should include a
README.md
file to provide an overview of the benchmark and any specific instructions for running it.
- Create a new folder under the
-
Dataset
- Ensure the benchmark folder includes a
dataset.py
file, which contains the dataset loading logic. - Implement a subclass that inherits from
BaseDataset
. - The subclass must implement the
load
method, which returns alist[dict]
, where each element is an evaluation data sample and must contain a uniqueid
field.
- Ensure the benchmark folder includes a
-
Evaluator
- The folder must include an evaluator script, typically named
eval_{benchmark_name}.py
, which implements the evaluation logic. - This script should inherit from
BaseEvaluator
and contain the benchmark's evaluation logic. - The subclass should implement
set_generation_configs
to determine the default token generation settings for the LLM during evaluation. - Implement
load_batched_dataset
to load the batched dataset fromdataset.py
. - Implement
scoring
to evaluate one data item from the datasets, returning the evaluation result in a dictionary format. - Implement
compute_overall
to aggregate the evaluation results into an overall assessment, returning a dictionary with the final evaluation results.
- The folder must include an evaluator script, typically named
-
Registering the Benchmark
- Add the benchmark to the
__init__.py
file under thebenchs
directory to ensure it is discoverable by the framework. - Import the benchmark class in the
__init__.py
file and add it to the__all__
list.
- Add the benchmark to the
-
Documentation
- Create a
README.md
file in the benchmark folder. - Include any specific instructions or requirements for running the benchmark.
- Add one line to the
README.md
file in the root directory under the## Eval Suite
section to introduce the new benchmark.
- Create a
You can refer to the src/eval_suite/llms/huggingface.py
and src/eval_suite/llms/openai_api.py
files as examples for loading LLMs.
-
Language Model Loader
- Create a new file under the
llms
directory. - The file should contain the logic for loading the LLM from a specific source (e.g., Hugging Face, OpenAI API).
- Create a new file under the
-
Implementation Steps
- Implement a subclass that inherits from
BaseLLM
. - The subclass must implement
update_generation_configs
to handle parameter conversions, as different LLM loaders may have varying parameter names (e.g.,max_tokens
vs.max_new_tokens
). - The subclass must implement
_request
, which accepts astr
as input and returns astr
as the generated output.
- Implement a subclass that inherits from
-
Registering the LLM Loader
- Register the new LLM loader in the
__init__.py
file under thellms
directory. - Import the new LLM loader in the
__init__.py
file and add it to the__all__
list.
- Register the new LLM loader in the