-
Notifications
You must be signed in to change notification settings - Fork 50
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
612fcbe
commit f75d9f9
Showing
119 changed files
with
744 additions
and
4,298 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -168,6 +168,4 @@ data/ | |
version.txt | ||
|
||
actions-runner/ | ||
experiments/ | ||
examples/ | ||
results/ | ||
experiments/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,61 +1,66 @@ | ||
# Optimum-Benchmark x LLaMAs x GPTQ | ||
# Optimum-Benchmark x LLaMA | ||
|
||
A set of benchmarks on Meta's LLaMA2's inference. | ||
|
||
## Setup | ||
|
||
You will need to install these quantization packages: | ||
You will need to install any necessary third-party libraries like `deepspeed` or `auto-gptq` depending on the hardware and benchmarks you want to run. | ||
|
||
```bash | ||
pip install auto-gptq # or install it from source | ||
``` | ||
For example running FlashAttentionV2 on two devices with Tensor Parallelism (i.e. `fp16+fa2+tp=2`) will require: `deepspeed` and `flash-attn` | ||
|
||
## Running | ||
|
||
Then run these commands from this directory: | ||
Then run the benchmarks from this directory with: | ||
|
||
```bash | ||
optimum-benchmark --config-dir configs/ --config-name _base_ --multirun | ||
optimum-benchmark --config-dir configs/ --config-name gptq --multirun | ||
optimum-benchmark --config-dir configs/ --config-name fp16 --multirun | ||
optimum-benchmark --config-dir configs/ --config-name fp16+fa2+tp=2 --multirun | ||
[...] | ||
``` | ||
|
||
This will create a folder called `experiments` with the results of the benchmarks with an inference `batch_size` ranging from 1 to 16 and an input `sequence_length` (prompt size) of 256. | ||
This will create a folder called `experiments` with the results of the benchmarks with an inference `batch_size` ranging from 1 to 128 and an input `sequence_length` (prompt size) of 256. | ||
|
||
## Reporting | ||
|
||
To create a report run: | ||
To create a report for 7B models on A100-80GB, run: | ||
|
||
```bash | ||
python report.py -e experiments -m allocated | ||
python report.py -e experiments/hf-dgx-01/NousResearch/Llama-2-7b-hf/ experiments/hf-dgx-01/TheBloke/LLaMa-7B-GPTQ/ -r artifacts/Llama-7b/ | ||
python report.py -e experiments/hf-dgx-01/NousResearch/Llama-2-13b-hf/ experiments/hf-dgx-01/TheBloke/LLaMa-13B-GPTQ/ -r artifacts/Llama-13b/ | ||
python report.py -e experiments/hf-dgx-01/NousResearch/Llama-2-65b-hf/ experiments/hf-dgx-01/TheBloke/LLaMa-65B-GPTQ/ -r artifacts/Llama-65b/ | ||
``` | ||
|
||
Which will create some quick reporting artifacts like a `full_report.csv`, `short_report.csv`, some plots and a `rich_table.svg`. | ||
|
||
`-e` is the experiments folder from which to read the results. | ||
`-r` is the report folder to which to write the resulting artifacts. | ||
`-m` is the memory type to use for the reporting. It can be `used`, `allocated` or `reserved`. | ||
Which will create some quick reporting artifacts like a `full_report.csv`, `short_report.csv`, and some interesting analysis plots. | ||
|
||
|
||
## Results | ||
|
||
### On A100-80GB | ||
### LLaMA-7B on A100-80GB | ||
|
||
<p align="center"> | ||
<img src="artifacts/A100-80GB/forward_latency_plot.png" alt="latency_plot" width="60%"/> | ||
<img src="artifacts/Llama-7b/decode_throughput_bar_plot.png" alt="throughput_plot" width="60%"/> | ||
</p> | ||
|
||
<p align="center"> | ||
<img src="artifacts/A100-80GB/generate_throughput_plot.png" alt="throughput_plot" width="60%"/> | ||
<img src="artifacts/Llama-7b/prefill_latency_bar_plot.png" alt="latency_plot" width="60%"/> | ||
</p> | ||
|
||
### LLaMA-13B on A100-80GB | ||
|
||
<p align="center"> | ||
<img src="artifacts/A100-80GB/forward_memory_plot.png" alt="memory_plot" width="60%"/> | ||
<img src="artifacts/Llama-13b/decode_throughput_bar_plot.png" alt="throughput_plot" width="60%"/> | ||
</p> | ||
|
||
<p align="center"> | ||
<img src="artifacts/A100-80GB/generate_memory_plot.png" alt="memory_plot" width="60%"/> | ||
<img src="artifacts/Llama-13b/prefill_latency_bar_plot.png" alt="latency_plot" width="60%"/> | ||
</p> | ||
|
||
### LLaMA-65B on A100-80GB | ||
|
||
<p align="center"> | ||
<img src="artifacts/A100-80GB/rich_table.svg" alt="rich_table" width="90%"/> | ||
<img src="artifacts/Llama-65b/decode_throughput_bar_plot.png" alt="throughput_plot" width="60%"/> | ||
</p> | ||
|
||
<p align="center"> | ||
<img src="artifacts/Llama-65b/prefill_latency_bar_plot.png" alt="latency_plot" width="60%"/> | ||
</p> |
Binary file removed
BIN
-38.7 KB
examples/running-llamas/artifacts/A100-80GB/forward_latency_plot.png
Binary file not shown.
Binary file removed
BIN
-36.2 KB
examples/running-llamas/artifacts/A100-80GB/forward_memory_plot.png
Binary file not shown.
11 changes: 0 additions & 11 deletions
11
examples/running-llamas/artifacts/A100-80GB/full_report.csv
This file was deleted.
Oops, something went wrong.
Binary file removed
BIN
-40.7 KB
examples/running-llamas/artifacts/A100-80GB/generate_memory_plot.png
Binary file not shown.
Binary file removed
BIN
-36.7 KB
examples/running-llamas/artifacts/A100-80GB/generate_throughput_plot.png
Binary file not shown.
Oops, something went wrong.