Benchmarking prefix cache routing #390

samos123 · 2025-02-10T02:26:01Z

No description provided.

nstogner · 2025-02-10T10:37:03Z

These results look pretty promising. A few things:

Since we have so much of a gap in TTFT between PrefixHash and LeastLoad load balancing modes, I think it makes sense to add in a 3rd scenario to sanity check: raw k8s Service (i.e. round-robin). This will give us an indication of whether there is a bug in the LeastLoad codepath that might be causing the ramp in TTFT. This can be done by setting minReplicas: 8 in the Model, and applying a new Service that will bypass KubeAI (set --base-url=http://bypass-kubeai in the benchmark script):

apiVersion: v1
kind: Service
metadata:
  name: bypass-kubeai
spec:
  selector:
    model: llama-3.1-8b-instruct-fp8-l4
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8000

I am wondering why TTFT actually gets shorter at medium load for PrefixHash routing - any thoughts? Perhaps the dip is just within the variance that you might expect across different runs.
The graph is nice, would love to see one for the throughput numbers as well.
I am assuming you are keeping the Model object around across the different test runs and just changing the load balancing strategy field? Or are you deleting and recreating?

samos123 · 2025-02-10T16:42:22Z

Will do another test.
I think that's just pure variance between runs.
Will do.
Yes I only modified the model by patching the load balancing strategy. This caused the pods to stay alive.

nstogner

Great asset to have this benchmark!

nstogner · 2025-02-12T15:35:18Z

benchmarks/benchmark_serving/benchmark_serving.py

@@ -0,0 +1,1373 @@
+# SPDX-License-Identifier: Apache-2.0


Can you add comment with a link to where this file came from and what was changed?

nstogner · 2025-02-12T15:38:57Z

docs/README.md

@@ -19,6 +20,24 @@ Quotes from the community:

 > reusable, well abstracted solution to run LLMs - [Mike Ensor](https://www.linkedin.com/posts/mikeensor_gcp-solutions-public-retail-edge-available-cluster-traits-activity-7237515920259104769-vBs9?utm_source=share&utm_medium=member_desktop)

+## Why KubeAI?
+
+### Better performance at scale


+Intro: When running multiple replicas of a serving engine such as vLLM, performance under production traffic is heavily influence by the load balancing strategy.

nstogner · 2025-02-12T15:41:20Z

benchmarks/benchmark_serving/README.md

+- Load the whole conversation as prompts.
+- Limit the amount of max conversations and re-use the same conversation if needed.
+
+This allows us to verify whether prefix aware load balancing provides a performance


s/at large scale/under heavy production traffic with ongoing chat conversations/

nstogner · 2025-02-12T15:56:55Z

benchmarks/benchmark_serving/README.md

@@ -0,0 +1,16 @@
+# Benchmarking Text Generation


We use kebab-case in all of the other dirs in this project. Prefer to keep with that convention and rename this to benchmarks/chat-py/ and change the other to be benchmarks/chat-k6/.

samos123 added 4 commits February 9, 2025 18:28

add benchmarking script

a736146

add current state

d109456

update results

4f29f0b

remove archived part

015baf7

samos123 force-pushed the benchmarking-prefix-cache-routing branch from 4cb28f1 to 015baf7 Compare February 10, 2025 02:28

samos123 added 3 commits February 9, 2025 21:14

download dataset using curl

a4fb219

update

d9516a7

update write up

3a631c5

samos123 added 7 commits February 10, 2025 08:52

update text

87f50f9

update with k8s native

3ecf8af

update graph

1689476

add doc

5389dea

add more details about dataset

8d5ebbd

update README

b30340e

add direct service

0ba9330

samos123 changed the title ~~WIP Benchmarking prefix cache routing~~ Benchmarking prefix cache routing Feb 11, 2025

samos123 added 3 commits February 11, 2025 11:56

update README

d11307c

update README

78f355e

reduce width

ddc0b61

samos123 requested a review from nstogner February 12, 2025 00:11

nstogner requested changes Feb 12, 2025

View reviewed changes

samos123 added 4 commits February 12, 2025 14:28

rename directory benchmark_serving chat-py

a705604

add url

3f7ccd7

address PR comments

4498b7b

fix path

78a3558

samos123 requested a review from nstogner February 12, 2025 22:40

update README

b222f10

samos123 merged commit a2349bf into main Feb 12, 2025
16 checks passed

samos123 deleted the benchmarking-prefix-cache-routing branch February 12, 2025 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking prefix cache routing #390

Benchmarking prefix cache routing #390

samos123 commented Feb 10, 2025

nstogner commented Feb 10, 2025 •

edited

Loading

samos123 commented Feb 10, 2025

nstogner left a comment

nstogner Feb 12, 2025

samos123 Feb 12, 2025

nstogner Feb 12, 2025

samos123 Feb 12, 2025 •

edited

Loading

nstogner Feb 12, 2025

samos123 Feb 12, 2025

nstogner Feb 12, 2025

samos123 Feb 12, 2025

Benchmarking prefix cache routing #390

Benchmarking prefix cache routing #390

Conversation

samos123 commented Feb 10, 2025

nstogner commented Feb 10, 2025 • edited Loading

samos123 commented Feb 10, 2025

nstogner left a comment

Choose a reason for hiding this comment

nstogner Feb 12, 2025

Choose a reason for hiding this comment

samos123 Feb 12, 2025

Choose a reason for hiding this comment

nstogner Feb 12, 2025

Choose a reason for hiding this comment

samos123 Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

nstogner Feb 12, 2025

Choose a reason for hiding this comment

samos123 Feb 12, 2025

Choose a reason for hiding this comment

nstogner Feb 12, 2025

Choose a reason for hiding this comment

samos123 Feb 12, 2025

Choose a reason for hiding this comment

nstogner commented Feb 10, 2025 •

edited

Loading

samos123 Feb 12, 2025 •

edited

Loading