Skip to content

Commit

Permalink
Update llama examples (#90)
Browse files Browse the repository at this point in the history
  • Loading branch information
IlyasMoutawwakil authored Nov 30, 2023
1 parent 612fcbe commit f75d9f9
Show file tree
Hide file tree
Showing 119 changed files with 744 additions and 4,298 deletions.
4 changes: 1 addition & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,4 @@ data/
version.txt

actions-runner/
experiments/
examples/
results/
experiments/
49 changes: 27 additions & 22 deletions examples/running-llamas/README.md
Original file line number Diff line number Diff line change
@@ -1,61 +1,66 @@
# Optimum-Benchmark x LLaMAs x GPTQ
# Optimum-Benchmark x LLaMA

A set of benchmarks on Meta's LLaMA2's inference.

## Setup

You will need to install these quantization packages:
You will need to install any necessary third-party libraries like `deepspeed` or `auto-gptq` depending on the hardware and benchmarks you want to run.

```bash
pip install auto-gptq # or install it from source
```
For example running FlashAttentionV2 on two devices with Tensor Parallelism (i.e. `fp16+fa2+tp=2`) will require: `deepspeed` and `flash-attn`

## Running

Then run these commands from this directory:
Then run the benchmarks from this directory with:

```bash
optimum-benchmark --config-dir configs/ --config-name _base_ --multirun
optimum-benchmark --config-dir configs/ --config-name gptq --multirun
optimum-benchmark --config-dir configs/ --config-name fp16 --multirun
optimum-benchmark --config-dir configs/ --config-name fp16+fa2+tp=2 --multirun
[...]
```

This will create a folder called `experiments` with the results of the benchmarks with an inference `batch_size` ranging from 1 to 16 and an input `sequence_length` (prompt size) of 256.
This will create a folder called `experiments` with the results of the benchmarks with an inference `batch_size` ranging from 1 to 128 and an input `sequence_length` (prompt size) of 256.

## Reporting

To create a report run:
To create a report for 7B models on A100-80GB, run:

```bash
python report.py -e experiments -m allocated
python report.py -e experiments/hf-dgx-01/NousResearch/Llama-2-7b-hf/ experiments/hf-dgx-01/TheBloke/LLaMa-7B-GPTQ/ -r artifacts/Llama-7b/
python report.py -e experiments/hf-dgx-01/NousResearch/Llama-2-13b-hf/ experiments/hf-dgx-01/TheBloke/LLaMa-13B-GPTQ/ -r artifacts/Llama-13b/
python report.py -e experiments/hf-dgx-01/NousResearch/Llama-2-65b-hf/ experiments/hf-dgx-01/TheBloke/LLaMa-65B-GPTQ/ -r artifacts/Llama-65b/
```

Which will create some quick reporting artifacts like a `full_report.csv`, `short_report.csv`, some plots and a `rich_table.svg`.

`-e` is the experiments folder from which to read the results.
`-r` is the report folder to which to write the resulting artifacts.
`-m` is the memory type to use for the reporting. It can be `used`, `allocated` or `reserved`.
Which will create some quick reporting artifacts like a `full_report.csv`, `short_report.csv`, and some interesting analysis plots.


## Results

### On A100-80GB
### LLaMA-7B on A100-80GB

<p align="center">
<img src="artifacts/A100-80GB/forward_latency_plot.png" alt="latency_plot" width="60%"/>
<img src="artifacts/Llama-7b/decode_throughput_bar_plot.png" alt="throughput_plot" width="60%"/>
</p>

<p align="center">
<img src="artifacts/A100-80GB/generate_throughput_plot.png" alt="throughput_plot" width="60%"/>
<img src="artifacts/Llama-7b/prefill_latency_bar_plot.png" alt="latency_plot" width="60%"/>
</p>

### LLaMA-13B on A100-80GB

<p align="center">
<img src="artifacts/A100-80GB/forward_memory_plot.png" alt="memory_plot" width="60%"/>
<img src="artifacts/Llama-13b/decode_throughput_bar_plot.png" alt="throughput_plot" width="60%"/>
</p>

<p align="center">
<img src="artifacts/A100-80GB/generate_memory_plot.png" alt="memory_plot" width="60%"/>
<img src="artifacts/Llama-13b/prefill_latency_bar_plot.png" alt="latency_plot" width="60%"/>
</p>

### LLaMA-65B on A100-80GB

<p align="center">
<img src="artifacts/A100-80GB/rich_table.svg" alt="rich_table" width="90%"/>
<img src="artifacts/Llama-65b/decode_throughput_bar_plot.png" alt="throughput_plot" width="60%"/>
</p>

<p align="center">
<img src="artifacts/Llama-65b/prefill_latency_bar_plot.png" alt="latency_plot" width="60%"/>
</p>
Binary file not shown.
Binary file not shown.
11 changes: 0 additions & 11 deletions examples/running-llamas/artifacts/A100-80GB/full_report.csv

This file was deleted.

Binary file not shown.
Binary file not shown.
Loading

0 comments on commit f75d9f9

Please sign in to comment.