vllm benchmark script

Usage

Command

./vllm_benchmark_report.sh -s $test_option -m $model_repo -g $num_gpu -d $datatype

Note: The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don't need to specify them with this script.
Note: If you encounter this error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Variables

Name	Options	Description
$test_option	latency	Measure decoding token latency
	throughput	Measure token generation throughput
	all	Measure both throughput and latency
$model_repo	meta-llama/Meta-Llama-3.1-8B-Instruct	Llama 3.1 8B
(float16)	meta-llama/Meta-Llama-3.1-70B-Instruct	Llama 3.1 70B
	meta-llama/Meta-Llama-3.1-405B-Instruct	Llama 3.1 405B
	meta-llama/Llama-2-7b-chat-hf	Llama 2 7B
	meta-llama/Llama-2-70b-chat-hf	Llama 2 70B
	mistralai/Mixtral-8x7B-Instruct-v0.1	Mistral 8x7B
	mistralai/Mixtral-8x22B-Instruct-v0.1	Mistral 8x22B
	mistralai/Mistral-7B-Instruct-v0.3	Mistral 7B
	Qwen/Qwen2-7B-Instruct	Qwen2 7B
	Qwen/Qwen2-72B-Instruct	Qwen2 72B
	core42/jais-13b-chat	JAIS 13B
	core42/jais-30b-chat-v3	JAIS 30B
$model_repo	amd/Meta-Llama-3.1-8B-Instruct-FP8-KV	Llama 3.1 8B
(float8)	amd/Meta-Llama-3.1-70B-Instruct-FP8-KV	Llama 3.1 70B
	amd/Meta-Llama-3.1-405B-Instruct-FP8-KV	Llama 3.1 405B
	amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV	Mistral 8x7B
	amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV	Mistral 8x22B
$num_gpu	1 or 8	Number of GPUs.
$datatype	float16, float8

Run the benchmark tests on the MI300X accelerator 🏃

Here are some examples and the test results:

Benchmark example - latency

Use this command to benchmark the latency of the Llama 3.1 8B model on one GPU with the float16 and float8 data type.

./vllm_benchmark_report.sh -s latency -m meta-llama/Meta-Llama-3.1-8B-Instruct -g 1 -d float16
./vllm_benchmark_report.sh -s latency -m amd/Meta-Llama-3.1-8B-Instruct-FP8-KV -g 1 -d float8

You can find the latency report at ./reports_float16/summary/Meta-Llama-3.1-8B-Instruct_latency_report.csv. You can find the latency report at ./reports_float8/summary/Meta-Llama-3.1-8B-Instruct-FP8-KV_latency_report.csv.

Benchmark example - throughput

Use this command to benchmark the throughput of the Llama 3.1 8B model on one GPU with the float16 and float8 data type.

./vllm_benchmark_report.sh -s throughput -m meta-llama/Meta-Llama-3.1-8B-Instruct -g 1 -d float16
./vllm_benchmark_report.sh -s throughput -m amd/Meta-Llama-3.1-8B-Instruct-FP8-KV -g 1 -d float8

You can find the throughput report at ./reports_float16/summary/Meta-Llama-3.1-8B-Instruct_throughput_report.csv. You can find the throughput report at ./reports_float8/summary/Meta-Llama-3.1-8B-Instruct-FP8-KV_throughput_report.csv.

throughput_tot = requests * (input lengths + output lengths) / elapsed_time
throughput_gen = requests * output lengths / elapsed_time

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
run.sh		run.sh
vllm_benchmark_report.py		vllm_benchmark_report.py
vllm_benchmark_report.sh		vllm_benchmark_report.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vllm benchmark script

Usage

Command

Variables

Run the benchmark tests on the MI300X accelerator 🏃

About

Releases

Packages

Languages

seungrokj/unified_docker_benchmark_public

Folders and files

Latest commit

History

Repository files navigation

vllm benchmark script

Usage

Command

Variables

Run the benchmark tests on the MI300X accelerator 🏃

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages