Skip to content

Commit

Permalink
docs: update README with more contextual eval info
Browse files Browse the repository at this point in the history
Signed-off-by: Nathan Weinberg <[email protected]>
  • Loading branch information
nathan-weinberg committed Sep 18, 2024
1 parent 83f9d95 commit 8967a20
Show file tree
Hide file tree
Showing 2 changed files with 73 additions and 1 deletion.
5 changes: 5 additions & 0 deletions .spellcheck-en-custom.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,20 @@
# SPDX-License-Identifier: Apache-2.0
Backport
backported
benchmarking
codebase
dr
eval
gpt
hoc
instructlab
jsonl
justfile
MMLU
openai
SDG
Tatsu
tl
TODO
venv
vllm
69 changes: 68 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,74 @@

Python Library for Evaluation

## MT-Bench / MT-Bench-Branch Testing Steps
## What is Evaluation?

Evaluation is the step that allows us to gauge how a given model is performing against a set of specific tasks, by running known and standardized benchmark tests against
the model. Running this step gives us numerical scores across these various benchmarks, as well as logged excerpts/samples of the outputs the model produced during these
benchmarks. Using a combination of these artifacts as reference, along with manual smoke screening of a model allows us to get the best idea about whether or not a model
has learned and improved on something we are trying to teach it. There are 2 stages in the InstructLab process where we perform model evaluation:

#### Inter-checkpoint Evaluation

Check failure on line 17 in README.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Heading levels should only increment by one level at a time

README.md:17 MD001/heading-increment Heading levels should only increment by one level at a time [Expected: h3; Actual: h4] https://github.com/DavidAnson/markdownlint/blob/v0.35.0/doc/md001.md

This step occurs during multi-phase training. Each phase of training produces a number of different “checkpoints” of the model that are taken at various stages during
the phase. At the end of each phase, we need to evaluate all the checkpoints in order to find the one that provides the best results. This is done as part of the
[InstructLab Training](https://github.com/instructlab/training) library.

#### Full-scale final Evaluation

Once training is complete, and we have picked the best checkpoint from the output of the final phase, we can run full-scale evaluation suite which runs MT-Bench, MMLU,
MT-Bench Branch and MMLU Branch.

### Methods of Evaluation

Below are more in-depth explanations of the suite of benchmarks we are using as methods for evaluation of models.

#### Multi-turn benchmark (MT-Bench)

**tl;dr** Full model evaluation of performance on **skills**

MT-Bench is a type of benchmarking that involves asking a model 80 multi-turn questions - i.e.

```text
<Question 1> → <model’s answer 1> → <Follow-up question> → <model’s answer 2>
```

and having a “judge” model review the given multi-turn question, the provided model answer, and rate the answer with a score out of 10. The scores are then averaged out
and the final score produced is the “MT-bench score” for that model. This benchmark assumes no factual knowledge on the model’s part. The questions are static, but do
not get obsolete with time.

You can read more about MT-Bench [here](https://arxiv.org/abs/2306.05685)

#### MT-Bench Branch

MT-Bench Branch is an adaptation of MT-Bench that is designed to test custom skills that are added/being added to the model via the InstructLab project. These new skills
come in the form of question/answer pairs in a Git branch of the [taxonomy](https://github.com/instructlab/taxonomy).

MT-Bench Branch uses the user supplied seed questions to have the candidate model generate answers to, which are then judged by the judge model.

#### Massive Multitask Language Understanding (MMLU)

**tl;dr** Full model evaluation of performance on **knowledge**

MMLU is a type of benchmarking that involves a series of fact-based multiple choice questions, along with 4 options for answers. It tests if a model is able to interpret
the questions correctly, along the answers, formulate its own answer and pick the correct option out of the provided ones. The questions are designed as a set of 57
“tasks”, and each task has a given domain. The domains cover a number of topics ranging from Chemistry and Biology to US History and Math.

The performance is then compared against the set of known correct answers for each question to determine how many the model got right. The final MMLU score is the
average of its scores. This benchmark does not involve any reference/critic model, and is a completely objective benchmark. This benchmark does assume factual knowledge
on the model’s part. The questions are static, therefore MMLU cannot be used to gauge the model’s knowledge on more recent topics.

InstructLab uses an implementation found [here](https://github.com/EleutherAI/lm-evaluation-harness) for running MMLU.

You can read more about MMLU [here](https://arxiv.org/abs/2306.05685)

#### MMLU Branch

MMLU Branch is an adaptation of MMLU that is designed to test custom knowledge that is being added to the model via a Git branch of the [taxonomy](https://github.com/instructlab/taxonomy).

To conduct MMLU Branch, we create new tasks that are specific to the topic that is covered within the Pull Request. The teacher model `Mixtral-8x-7b-instruct` is used to generate new multiple choice questions based on the knowledge document included in a PR. A “task” is then constructed that references the newly generated questions and choices. These tasks are then used to judge the model’s grasp on new knowledge the same way MMLU works. Generation of these tasks are done as part of the [InstructLab SDG](https://github.com/instructlab/sdg) library.

## MT-Bench / MT-Bench Branch Testing Steps

> **⚠️ Note:** Must use Python version 3.10 or later.
Expand Down

0 comments on commit 8967a20

Please sign in to comment.