Skip to content

Commit

Permalink
add instructions about getting mmlu score for instruct models (pytorc…
Browse files Browse the repository at this point in the history
…h#6175)

Summary:
Pull Request resolved: pytorch#6175

imported-using-ghimport

Test Plan: Imported from OSS

Reviewed By: mergennachin

Differential Revision: D64256005

Pulled By: helunwencser

fbshipit-source-id: b799d311cde065bbbf94f389c1c407c3b59b1da2
  • Loading branch information
helunwencser authored and facebook-github-bot committed Oct 12, 2024
1 parent 5512fe0 commit 1f2b9aa
Showing 1 changed file with 24 additions and 4 deletions.
28 changes: 24 additions & 4 deletions examples/models/llama2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ We employed 4-bit groupwise per token dynamic quantization of all the linear lay

We evaluated WikiText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness). Please note that LM Eval reports perplexity normalized by word count instead of token count. You may see different perplexity for WikiText from other sources if they implement it differntly. More details could be found [here](https://github.com/EleutherAI/lm-evaluation-harness/issues/2301).

Below are the results for two different groupsizes, with max_seq_len 2048, and 1000 samples.
Below are the results for two different groupsizes, with max_seq_length 2048, and limit 1000.

|Model | Baseline (FP32) | Groupwise 4-bit (128) | Groupwise 4-bit (256)
|--------|-----------------| ---------------------- | ---------------
Expand Down Expand Up @@ -280,12 +280,32 @@ tokenizer.path=<path_to_checkpoint_folder>/tokenizer.model

> Forewarning: Model evaluation without a GPU may take a long time, especially on larger models.
Using the same arguments from above
We use [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate model accuracy.

For base models, use the following example command to calculate its perplexity based on WikiText.
```
python -m examples.models.llama2.eval_llama -c <checkpoint.pth> -p <params.json> -t <tokenizer.model/bin> -d fp32 --max_seq_len <max sequence length> --limit <number of samples>
python -m examples.models.llama2.eval_llama \
-c <checkpoint.pth> \
-p <params.json> \
-t <tokenizer.model/bin> \
-kv \
-d <checkpoint dtype> \
--max_seq_len <max sequence length> \
--limit <number of samples>
```

The Wikitext results generated above used: `{max_seq_len: 2048, limit: 1000}`
For instruct models, use the following example command to calculate its MMLU score.
```
python -m examples.models.llama2.eval_llama \
-c <checkpoint.pth> \
-p <params.json> \
-t <tokenizer.model/bin> \
-kv \
-d <checkpoint dtype> \
--tasks mmlu \
--num_fewshot 5 \
--max_seq_len <max sequence length>
```

## Step 4: Run on your computer to validate

Expand Down

0 comments on commit 1f2b9aa

Please sign in to comment.