Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking prefix cache routing #390

Merged
merged 22 commits into from
Feb 12, 2025
Merged

Conversation

samos123
Copy link
Contributor

No description provided.

@samos123 samos123 force-pushed the benchmarking-prefix-cache-routing branch from 4cb28f1 to 015baf7 Compare February 10, 2025 02:28
@nstogner
Copy link
Contributor

nstogner commented Feb 10, 2025

These results look pretty promising. A few things:

  1. Since we have so much of a gap in TTFT between PrefixHash and LeastLoad load balancing modes, I think it makes sense to add in a 3rd scenario to sanity check: raw k8s Service (i.e. round-robin). This will give us an indication of whether there is a bug in the LeastLoad codepath that might be causing the ramp in TTFT. This can be done by setting minReplicas: 8 in the Model, and applying a new Service that will bypass KubeAI (set --base-url=http://bypass-kubeai in the benchmark script):
apiVersion: v1
kind: Service
metadata:
  name: bypass-kubeai
spec:
  selector:
    model: llama-3.1-8b-instruct-fp8-l4
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8000
  1. I am wondering why TTFT actually gets shorter at medium load for PrefixHash routing - any thoughts? Perhaps the dip is just within the variance that you might expect across different runs.

  2. The graph is nice, would love to see one for the throughput numbers as well.

  3. I am assuming you are keeping the Model object around across the different test runs and just changing the load balancing strategy field? Or are you deleting and recreating?

@samos123
Copy link
Contributor Author

  1. Will do another test.
  2. I think that's just pure variance between runs.
  3. Will do.
  4. Yes I only modified the model by patching the load balancing strategy. This caused the pods to stay alive.

@samos123 samos123 changed the title WIP Benchmarking prefix cache routing Benchmarking prefix cache routing Feb 11, 2025
@samos123 samos123 requested a review from nstogner February 12, 2025 00:11
Copy link
Contributor

@nstogner nstogner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great asset to have this benchmark!

@@ -0,0 +1,1373 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comment with a link to where this file came from and what was changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -19,6 +20,24 @@ Quotes from the community:

> reusable, well abstracted solution to run LLMs - [Mike Ensor](https://www.linkedin.com/posts/mikeensor_gcp-solutions-public-retail-edge-available-cluster-traits-activity-7237515920259104769-vBs9?utm_source=share&utm_medium=member_desktop)

## Why KubeAI?

### Better performance at scale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+Intro: When running multiple replicas of a serving engine such as vLLM, performance under production traffic is heavily influence by the load balancing strategy.

Copy link
Contributor Author

@samos123 samos123 Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- Load the whole conversation as prompts.
- Limit the amount of max conversations and re-use the same conversation if needed.

This allows us to verify whether prefix aware load balancing provides a performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/at large scale/under heavy production traffic with ongoing chat conversations/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,16 @@
# Benchmarking Text Generation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use kebab-case in all of the other dirs in this project. Prefer to keep with that convention and rename this to benchmarks/chat-py/ and change the other to be benchmarks/chat-k6/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@samos123 samos123 requested a review from nstogner February 12, 2025 22:40
@samos123 samos123 merged commit a2349bf into main Feb 12, 2025
16 checks passed
@samos123 samos123 deleted the benchmarking-prefix-cache-routing branch February 12, 2025 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants