-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: Throughput and Latency degradation with a single LoRA adapter on A100 40 GB #10062
Comments
This is a very detailed and excellent description. |
The latest version; see the container image in the yaml: Do you recommend a specific version to test with? |
There are some similar issues, see: #9496 and #9452. The main reason is due to enable_eager=true, but I can't find this argument in your script. BTW, if I remember correctly, some versions had cuda graph bug, so I suggest you try doing profiling and confirm whether eager mode is disabled correctly, you can refer to :https://docs.vllm.ai/en/latest/dev/profiling/profiling_index.html#openai-server. PS:I will try to reproduce your results tomorrow in my local timezone and #5036 provibe some test results |
Thanks @jeejeelee it will be great if you can reproduce! |
We conducted testing on a local A800 (A800-SXM4-80GB).
vllm serve meta-llama/Llama-2-7b-chat-hf --gpu-memory-utilization 0.90 --served-model-name base --enable-lora --max-loras 3 --max-cpu-loras 15 --max-lora-rank 64 --lora-modules moss=xtuner/Llama-2-7b-qlora-moss-003-sft
python benchmark_serving.py --model base --tokenizer meta-llama/Llama-2-7b-chat-hf --dataset-name random --random-input-len 512 --random-output-len 128 --ignore-eos --num-prompts 24 --metric-percentiles 90 --request-rate 20(from 1to 24)
|
Thanks @jeejeelee, this is very insightful, did you try the smaller adapter? |
Not yet, I will tomorrow |
same question, any solution yet? |
I will also help reproduce this issue from my end this week. What I observe is around 20%-25% overhead which is expected. Seem we need to standarize the lora workloads and benchmark to better help users reproduce the results |
throughput/latency vs kv cache utilizationI did some new benchmarks and noticed that max lora rank has a significant impact on performance, and its best to set it = the rank of lora (or the rank of the largest ranked lora if using multiple lora). This is consistent with what is documented here. With rank = 16, the throughput hit is about 27% at 80% kv cache utilization. (tp-2 indicates tensor parallelism = 2 i.e 2 GPUs were used) I also enabled the vLLM profiler to get a more granular understanding of where the performance hit is coming from. Performance Analysis:vLLM's profiler provides slice flamegraphs. Tweet summary (max rank 64) running online with 96 prompts revealed cudaMemcpyAsync as a major latency contributor 47% of the total 35 seconds. ![]() The base model's slice flowgraph showed cudaMemcpyAsync using 40% of the 27.96 seconds (96 prompts). ![]() The 8-second difference between base and LoRA models (same number of prompts) was largely due to |
Do we use pinned host memory? I remember in the past that allocating pinned memory significantly improve copy performance. |
Hey guys, thanks for the detailed profiling. I want to provide some context on how to profile GPU + CPU code for your edification. It is important to understand that the GPU and CPU can run at the same time. Calling a Torch Module is asynchronous relative to the CPU. The CPU continues executing instructions until it reaches a synchronization point when we copy the data from the GPU to the CPU. The function What this tells us is that the source of the bottleneck is GPU execution time (i.e. how fast the model is running). When we profile the GPU code (see the graph below), we see that the LoRA adapter execution is adding significant overhead relative to the number of FLOPs. As a result, we are focusing efforts on optimizing the execution of LoRA adapters on the GPU. |
@varun-sundar-rabindranath just landed #11579, which is the script for tuning the kernels. We are using this to optimize for A100 and H100. |
Thanks @robertgshaw2-redhat @varun-sundar-rabindranath, as you pointed out, the slice flamegraphs above are for CPU. I've reloaded the traces and focused on the GPU kernel slices, which I've attached below. I assume you're already aware of this, but the main difference between the two traces is due to these two Punica steps, which together account for roughly 7 seconds — almost the total time delta between the two experiments.
Experiment 1: Base Model, 96 prompts, llama 7B (script: benchmark_serving.py). Total time taken = 28 seconds. ![]() Experiment 2: Tweet Summary, 96 prompts, max lora rank = 64, max lora = 1 llama 7B (script: benchmark_serving.py). Total time taken = 35 seconds. ![]() ![]() |
Proposal to improve performance
No response
Report of performance regression
No response
Misc discussion on performance
Setup Summary for vLLM Benchmarking with Llama-2 Model:
Hardware: A100 40 GB (a2-highgpu-2g) on Google Kubernetes Engine (GKE)
Model:
meta-llama/Llama-2-7b-hf
GPU Count: 1
Experiments:
meta-llama/Llama-2-7b-hf
.vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
(size 160 MB).xtuner/Llama-2-7b-qlora-moss-003-sft
(size 640 MB).For all three experiments, we used the same input prompt (ShareGPT) and observed a similar output length.
Settings:
Benchmark Metrics:
We measured:
You can view detailed results in the benchmark document: Benchmark 1 server - Sheet7.pdf.
Observations and Questions:
Deployment Command:
Your current environment (if you think it is necessary)
Sample Query:
Deployment YAML Configuration:
This deployment configuration sets up the vLLM server with LoRA adapters on GKE, with health probes, GPU limits, and a volume configuration for adapter management.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: