[Performance]: Throughput and Latency degradation with a single LoRA adapter on A100 40 GB #10062

kaushikmitr · 2024-11-06T00:42:03Z

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance

Setup Summary for vLLM Benchmarking with Llama-2 Model:

Hardware: A100 40 GB (a2-highgpu-2g) on Google Kubernetes Engine (GKE)
Model: meta-llama/Llama-2-7b-hf
GPU Count: 1
Experiments:
- Experiment 1: Requests using the base model meta-llama/Llama-2-7b-hf.
- Experiment 2: vLLM deployed with LoRA adapter vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm (size 160 MB).
- Experiment 3: vLLM deployed with LoRA adapter xtuner/Llama-2-7b-qlora-moss-003-sft (size 640 MB).
For all three experiments, we used the same input prompt (ShareGPT) and observed a similar output length.

Settings:

Eager Mode: Not enabled.
Max GPU Utilization: Default at 90%.

Benchmark Metrics:
We measured:

Latency per output token
Throughput (output tokens per second)

You can view detailed results in the benchmark document: Benchmark 1 server - Sheet7.pdf.

Observations and Questions:

Using LoRA adapters led to a notable degradation in throughput and latency compared to the base model. Specifically, we observed up to a 50% drop in maximum throughput with LoRA compared to the base model.
Is this performance degradation expected with LoRA adapters?
Are there parameters or tuning options that could improve LoRA performance?

Deployment Command:

command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
  - "--model"
  - "meta-llama/Llama-2-7b-hf"
  - "--tensor-parallel-size"
  - "1"
  - "--port"
  - "8000"
  - "--disable-log-requests"
  - "--enable-lora"
  - "--max-loras"
  - "3"
  - "--max-cpu-loras"
  - "15"
  - "--max-lora-rank"
  - "64"
  - "--gpu-memory-utilization"
  - "0.9"
  - "--lora-modules"
  - xtuner/Llama-2-7b-qlora-moss-003-sft

Your current environment (if you think it is necessary)

Sample Query:

curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
  "model": "tweet-summary",
  "prompt": "Write as if you were a critic: San Francisco",
  "max_tokens": 100,
  "temperature": 0
}'

Deployment YAML Configuration:

---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama2-7b-pool
spec:
  selector:
    app: vllm-llama2-7b-pool
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
  type: LoadBalancer

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-7b-pool
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama2-7b-pool
  template:
    metadata:
      labels:
        app: vllm-llama2-7b-pool
    spec:
      containers:
        - name: lora
          image: "vllm/vllm-openai:latest"
          imagePullPolicy: Always
          command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
          args:
            - "--model"
            - "meta-llama/Llama-2-7b-hf"
            - "--tensor-parallel-size"
            - "1"
            - "--port"
            - "8000"
            - "--disable-log-requests"
            - "--enable-lora"
            - "--max-loras"
            - "3"
            - "--max-cpu-loras"
            - "15"
            - "--max-lora-rank"
            - "64"
            - "--gpu-memory-utilization"
            - "0.9"
            - "--lora-modules"
            - "tweet-summary-0=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0"
          env:
            - name: PORT
              value: "8000"
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          livenessProbe:
            failureThreshold: 240
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 600
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1
          volumeMounts:
            - mountPath: /data
              name: data
            - mountPath: /dev/shm
              name: shm
            - name: adapters
              mountPath: "/adapters"
      initContainers:
        - name: adapter-loader
          image: ghcr.io/tomatillo-and-multiverse/adapter-puller:demo
          command: ["python"]
          args:
            - ./pull_adapters.py
            - --adapter
            - xtuner/Llama-2-7b-qlora-moss-003-sft
            - --adapter
            - yard1/llama-2-7b-sql-lora-test
            - --adapter
            - vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
            - --duplicate-count
            - "5"
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
            - name: HF_HOME
              value: /adapters
          volumeMounts:
            - name: adapters
              mountPath: "/adapters"
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      volumes:
        - name: data
          emptyDir: {}
        - name: shm
          emptyDir:
            medium: Memory
        - name: adapters
          emptyDir: {}

This deployment configuration sets up the vLLM server with LoRA adapters on GKE, with health probes, GPU limits, and a volume configuration for adapter management.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

jeejeelee · 2024-11-06T01:56:36Z

This is a very detailed and excellent description.
Which vllm version are you using?

ahg-g · 2024-11-06T15:39:44Z

The latest version; see the container image in the yaml: vllm/vllm-openai:latest; so that would be https://github.com/vllm-project/vllm/releases/tag/v0.6.3.post1 since those tests were run this week.

Do you recommend a specific version to test with?

jeejeelee · 2024-11-06T16:12:09Z

There are some similar issues, see: #9496 and #9452. The main reason is due to enable_eager=true, but I can't find this argument in your script.

BTW, if I remember correctly, some versions had cuda graph bug, so I suggest you try doing profiling and confirm whether eager mode is disabled correctly, you can refer to :https://docs.vllm.ai/en/latest/dev/profiling/profiling_index.html#openai-server.

PS：I will try to reproduce your results tomorrow in my local timezone and #5036 provibe some test results

ahg-g · 2024-11-07T03:16:41Z

Thanks @jeejeelee it will be great if you can reproduce!

jeejeelee · 2024-11-07T16:05:49Z

We conducted testing on a local A800 (A800-SXM4-80GB).

For vLLM(version:https://github.com/vllm-project/vllm/tree/a5bba7d234b4e0d82e6a64de82a8497760ed44cf), we built from source code directly instead of building an image (due to network ).
The script to start the service is:

 vllm serve meta-llama/Llama-2-7b-chat-hf --gpu-memory-utilization 0.90 --served-model-name base --enable-lora --max-loras 3 --max-cpu-loras 15 --max-lora-rank 64 --lora-modules moss=xtuner/Llama-2-7b-qlora-moss-003-sft

The benchmarking script is:

python benchmark_serving.py --model base --tokenizer meta-llama/Llama-2-7b-chat-hf  --dataset-name random --random-input-len 512 --random-output-len 128 --ignore-eos --num-prompts 24 --metric-percentiles 90 --request-rate 20(from 1to 24)

Although we set our total number of requests relatively low, still observed trends similar to yours, see:
benchmark.pdf. Therefore, I believe I can answer your questions:

Q:
Is this performance degradation expected with LoRA adapters?
Are there parameters or tuning options that could improve LoRA performance?
A:
This behavior is expected. Unlike the base model, when we increase the request rate, LoRA become progressively compute-bound  leading to increased latency.,Currently no tuning  option is provided. you could try implementing Tensor Parallel.

We will try to increase the total number of requests for future tests, and will continue to investigate and work on resolving these issues.
BTW @simon-mo has added a fea-lora channel on Slack where we can discuss these topics.

ahg-g · 2024-11-07T16:33:26Z

Thanks @jeejeelee, this is very insightful, did you try the smaller adapter?

jeejeelee · 2024-11-07T16:40:20Z

Thanks @jeejeelee, this is very insightful, did you try the smaller adapter?

Not yet, I will tomorrow

Bingogogogogo · 2024-11-17T06:45:24Z

same question, any solution yet?

Jeffwan · 2024-11-21T18:45:12Z

I will also help reproduce this issue from my end this week. What I observe is around 20%-25% overhead which is expected. Seem we need to standarize the lora workloads and benchmark to better help users reproduce the results

kaushikmitr · 2025-01-22T19:21:59Z

throughput/latency vs kv cache utilization

I did some new benchmarks and noticed that max lora rank has a significant impact on performance, and its best to set it = the rank of lora (or the rank of the largest ranked lora if using multiple lora). This is consistent with what is documented here.

With rank = 16, the throughput hit is about 27% at 80% kv cache utilization.
With rank = 64, the throughput hit is about 50% at 80% kv cache utilization. (same as my initial benchmarking above).

(tp-2 indicates tensor parallelism = 2 i.e 2 GPUs were used)

I also enabled the vLLM profiler to get a more granular understanding of where the performance hit is coming from.

Performance Analysis:

vLLM's profiler provides slice flamegraphs. Tweet summary (max rank 64) running online with 96 prompts revealed cudaMemcpyAsync as a major latency contributor 47% of the total 35 seconds.

The base model's slice flowgraph showed cudaMemcpyAsync using 40% of the 27.96 seconds (96 prompts).

The 8-second difference between base and LoRA models (same number of prompts) was largely due to cudaMemcpyAsync (60% of the delta can be explained by the extra time taken by cudaMemcpyAsync). While LoRA weights are <2% of the base model size, the significant difference in cudaMemcpyAsync with and without LoRA is unclear.

ahg-g · 2025-01-23T23:58:20Z

Do we use pinned host memory? I remember in the past that allocating pinned memory significantly improve copy performance.

robertgshaw2-redhat · 2025-01-25T17:16:11Z

Hey guys, thanks for the detailed profiling.

I want to provide some context on how to profile GPU + CPU code for your edification. It is important to understand that the GPU and CPU can run at the same time. Calling a Torch Module is asynchronous relative to the CPU. The CPU continues executing instructions until it reaches a synchronization point when we copy the data from the GPU to the CPU.

The function samples.tolist() is the point at which we copy the data from the GPU to the CPU (this is the synchronization point). So what is happening is that the CPU is waiting for the GPU to finish processing the forward pass of the model. From the POV of the CPU profiler, all of the time is spent in the "memcopy" operation, but really what is happening is that the CPU is waiting for the GPU to finish executing.

What this tells us is that the source of the bottleneck is GPU execution time (i.e. how fast the model is running). When we profile the GPU code (see the graph below), we see that the LoRA adapter execution is adding significant overhead relative to the number of FLOPs.

As a result, we are focusing efforts on optimizing the execution of LoRA adapters on the GPU.

robertgshaw2-redhat · 2025-01-25T17:17:23Z

@varun-sundar-rabindranath just landed #11579, which is the script for tuning the kernels. We are using this to optimize for A100 and H100.

kaushikmitr · 2025-01-27T01:05:00Z

Thanks @robertgshaw2-redhat @varun-sundar-rabindranath, as you pointed out, the slice flamegraphs above are for CPU. I've reloaded the traces and focused on the GPU kernel slices, which I've attached below. I assume you're already aware of this, but the main difference between the two traces is due to these two Punica steps, which together account for roughly 7 seconds — almost the total time delta between the two experiments.

_bgmv_shrink_kernel 4.4 secs
_bgmv_expand_slice_kernel 2.8 secs.

Experiment 1: Base Model, 96 prompts, llama 7B (script: benchmark_serving.py). Total time taken = 28 seconds.
GPU Slice Flamegraph:

Experiment 2: Tweet Summary, 96 prompts, max lora rank = 64, max lora = 1 llama 7B (script: benchmark_serving.py). Total time taken = 35 seconds.

kaushikmitr added the performance Performance-related issues label Nov 6, 2024

kaushikmitr changed the title ~~[Performance]: Throughput and Latency degradation with and without single LoRA adapter~~ [Performance]: Throughput and Latency degradation with a single LoRA adapter on A100 40 GB Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: Throughput and Latency degradation with a single LoRA adapter on A100 40 GB #10062

[Performance]: Throughput and Latency degradation with a single LoRA adapter on A100 40 GB #10062

kaushikmitr commented Nov 6, 2024

jeejeelee commented Nov 6, 2024

ahg-g commented Nov 6, 2024 •

edited

Loading

jeejeelee commented Nov 6, 2024 •

edited

Loading

ahg-g commented Nov 7, 2024

jeejeelee commented Nov 7, 2024 •

edited

Loading

ahg-g commented Nov 7, 2024

jeejeelee commented Nov 7, 2024

Bingogogogogo commented Nov 17, 2024

Jeffwan commented Nov 21, 2024 •

edited

Loading

kaushikmitr commented Jan 22, 2025 •

edited

Loading

ahg-g commented Jan 23, 2025

robertgshaw2-redhat commented Jan 25, 2025

robertgshaw2-redhat commented Jan 25, 2025

kaushikmitr commented Jan 27, 2025 •

edited

Loading

[Performance]: Throughput and Latency degradation with a single LoRA adapter on A100 40 GB #10062

[Performance]: Throughput and Latency degradation with a single LoRA adapter on A100 40 GB #10062

Comments

kaushikmitr commented Nov 6, 2024

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

jeejeelee commented Nov 6, 2024

ahg-g commented Nov 6, 2024 • edited Loading

jeejeelee commented Nov 6, 2024 • edited Loading

ahg-g commented Nov 7, 2024

jeejeelee commented Nov 7, 2024 • edited Loading

ahg-g commented Nov 7, 2024

jeejeelee commented Nov 7, 2024

Bingogogogogo commented Nov 17, 2024

Jeffwan commented Nov 21, 2024 • edited Loading

kaushikmitr commented Jan 22, 2025 • edited Loading

throughput/latency vs kv cache utilization

Performance Analysis:

ahg-g commented Jan 23, 2025

robertgshaw2-redhat commented Jan 25, 2025

robertgshaw2-redhat commented Jan 25, 2025

kaushikmitr commented Jan 27, 2025 • edited Loading

ahg-g commented Nov 6, 2024 •

edited

Loading

jeejeelee commented Nov 6, 2024 •

edited

Loading

jeejeelee commented Nov 7, 2024 •

edited

Loading

Jeffwan commented Nov 21, 2024 •

edited

Loading

kaushikmitr commented Jan 22, 2025 •

edited

Loading

kaushikmitr commented Jan 27, 2025 •

edited

Loading