Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update llama examples #90

Merged
merged 29 commits into from
Nov 30, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
5668117
added new configs
IlyasMoutawwakil Nov 24, 2023
5c67276
update
IlyasMoutawwakil Nov 24, 2023
3f37606
fix
IlyasMoutawwakil Nov 24, 2023
038abd3
added decode latency and throughput
IlyasMoutawwakil Nov 26, 2023
387d012
fix dp config
IlyasMoutawwakil Nov 26, 2023
f412d81
fix decode metrics
IlyasMoutawwakil Nov 26, 2023
4dc5b95
added a100 results
IlyasMoutawwakil Nov 26, 2023
e194f6b
fix distributed throughput measurments
IlyasMoutawwakil Nov 27, 2023
ce60f97
fix report
IlyasMoutawwakil Nov 27, 2023
23801a1
added fp16+tp=1 for deepspeed baseline
IlyasMoutawwakil Nov 27, 2023
bf5626d
make isolation process failure more verbose
IlyasMoutawwakil Nov 27, 2023
adbc1e9
update A100 results
IlyasMoutawwakil Nov 27, 2023
8aec23b
added ddp config
IlyasMoutawwakil Nov 27, 2023
225d165
update custom reporting script
IlyasMoutawwakil Nov 27, 2023
301c608
remove tp=1
IlyasMoutawwakil Nov 27, 2023
3ab2533
style
IlyasMoutawwakil Nov 27, 2023
cccced8
added bt and fa2 configs
IlyasMoutawwakil Nov 27, 2023
6b6513c
fix isolation process
IlyasMoutawwakil Nov 27, 2023
14a32a6
added gptq config
IlyasMoutawwakil Nov 27, 2023
e61ca6b
update configs
IlyasMoutawwakil Nov 28, 2023
00543de
more isolation process changes
IlyasMoutawwakil Nov 28, 2023
c27c708
added an inline launcher (as default)
IlyasMoutawwakil Nov 28, 2023
f00d042
update distributed configs
IlyasMoutawwakil Nov 28, 2023
08bc26b
fix
IlyasMoutawwakil Nov 28, 2023
2584fd2
added fa2+tp=2 config and added llama 65B gptq
IlyasMoutawwakil Nov 28, 2023
6ddfdf7
disable no_weights because of of 65B model error
IlyasMoutawwakil Nov 28, 2023
873f799
added all A100 GPTQ results
IlyasMoutawwakil Nov 28, 2023
c8bbe4e
added A100 reports for llama 7b, 13b and 65b
IlyasMoutawwakil Nov 29, 2023
370d7ae
remove experiments (only leave artifacts)
IlyasMoutawwakil Nov 29, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,4 @@ data/
version.txt

actions-runner/
experiments/
examples/
results/
experiments/
49 changes: 27 additions & 22 deletions examples/running-llamas/README.md
Original file line number Diff line number Diff line change
@@ -1,61 +1,66 @@
# Optimum-Benchmark x LLaMAs x GPTQ
# Optimum-Benchmark x LLaMA

A set of benchmarks on Meta's LLaMA2's inference.

## Setup

You will need to install these quantization packages:
You will need to install any necessary third-party libraries like `deepspeed` or `auto-gptq` depending on the hardware and benchmarks you want to run.

```bash
pip install auto-gptq # or install it from source
```
For example running FlashAttentionV2 on two devices with Tensor Parallelism (i.e. `fp16+fa2+tp=2`) will require: `deepspeed` and `flash-attn`

## Running

Then run these commands from this directory:
Then run the benchmarks from this directory with:

```bash
optimum-benchmark --config-dir configs/ --config-name _base_ --multirun
optimum-benchmark --config-dir configs/ --config-name gptq --multirun
optimum-benchmark --config-dir configs/ --config-name fp16 --multirun
optimum-benchmark --config-dir configs/ --config-name fp16+fa2+tp=2 --multirun
[...]
```

This will create a folder called `experiments` with the results of the benchmarks with an inference `batch_size` ranging from 1 to 16 and an input `sequence_length` (prompt size) of 256.
This will create a folder called `experiments` with the results of the benchmarks with an inference `batch_size` ranging from 1 to 128 and an input `sequence_length` (prompt size) of 256.

## Reporting

To create a report run:
To create a report for 7B models on A100-80GB, run:

```bash
python report.py -e experiments -m allocated
python report.py -e experiments/hf-dgx-01/NousResearch/Llama-2-7b-hf/ experiments/hf-dgx-01/TheBloke/LLaMa-7B-GPTQ/ -r artifacts/Llama-7b/
python report.py -e experiments/hf-dgx-01/NousResearch/Llama-2-13b-hf/ experiments/hf-dgx-01/TheBloke/LLaMa-13B-GPTQ/ -r artifacts/Llama-13b/
python report.py -e experiments/hf-dgx-01/NousResearch/Llama-2-65b-hf/ experiments/hf-dgx-01/TheBloke/LLaMa-65B-GPTQ/ -r artifacts/Llama-65b/
```

Which will create some quick reporting artifacts like a `full_report.csv`, `short_report.csv`, some plots and a `rich_table.svg`.

`-e` is the experiments folder from which to read the results.
`-r` is the report folder to which to write the resulting artifacts.
`-m` is the memory type to use for the reporting. It can be `used`, `allocated` or `reserved`.
Which will create some quick reporting artifacts like a `full_report.csv`, `short_report.csv`, and some interesting analysis plots.


## Results

### On A100-80GB
### LLaMA-7B on A100-80GB

<p align="center">
<img src="artifacts/A100-80GB/forward_latency_plot.png" alt="latency_plot" width="60%"/>
<img src="artifacts/Llama-7b/decode_throughput_bar_plot.png" alt="throughput_plot" width="60%"/>
</p>

<p align="center">
<img src="artifacts/A100-80GB/generate_throughput_plot.png" alt="throughput_plot" width="60%"/>
<img src="artifacts/Llama-7b/prefill_latency_bar_plot.png" alt="latency_plot" width="60%"/>
</p>

### LLaMA-13B on A100-80GB

<p align="center">
<img src="artifacts/A100-80GB/forward_memory_plot.png" alt="memory_plot" width="60%"/>
<img src="artifacts/Llama-13b/decode_throughput_bar_plot.png" alt="throughput_plot" width="60%"/>
</p>

<p align="center">
<img src="artifacts/A100-80GB/generate_memory_plot.png" alt="memory_plot" width="60%"/>
<img src="artifacts/Llama-13b/prefill_latency_bar_plot.png" alt="latency_plot" width="60%"/>
</p>

### LLaMA-65B on A100-80GB

<p align="center">
<img src="artifacts/A100-80GB/rich_table.svg" alt="rich_table" width="90%"/>
<img src="artifacts/Llama-65b/decode_throughput_bar_plot.png" alt="throughput_plot" width="60%"/>
</p>

<p align="center">
<img src="artifacts/Llama-65b/prefill_latency_bar_plot.png" alt="latency_plot" width="60%"/>
</p>
Binary file not shown.
Binary file not shown.
11 changes: 0 additions & 11 deletions examples/running-llamas/artifacts/A100-80GB/full_report.csv

This file was deleted.

Binary file not shown.
Binary file not shown.
Loading