-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking prefix cache routing #390
Conversation
4cb28f1
to
015baf7
Compare
These results look pretty promising. A few things:
apiVersion: v1
kind: Service
metadata:
name: bypass-kubeai
spec:
selector:
model: llama-3.1-8b-instruct-fp8-l4
ports:
- protocol: TCP
port: 80
targetPort: 8000
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great asset to have this benchmark!
@@ -0,0 +1,1373 @@ | |||
# SPDX-License-Identifier: Apache-2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comment with a link to where this file came from and what was changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -19,6 +20,24 @@ Quotes from the community: | |||
|
|||
> reusable, well abstracted solution to run LLMs - [Mike Ensor](https://www.linkedin.com/posts/mikeensor_gcp-solutions-public-retail-edge-available-cluster-traits-activity-7237515920259104769-vBs9?utm_source=share&utm_medium=member_desktop) | |||
|
|||
## Why KubeAI? | |||
|
|||
### Better performance at scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+Intro: When running multiple replicas of a serving engine such as vLLM, performance under production traffic is heavily influence by the load balancing strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
- Load the whole conversation as prompts. | ||
- Limit the amount of max conversations and re-use the same conversation if needed. | ||
|
||
This allows us to verify whether prefix aware load balancing provides a performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/at large scale/under heavy production traffic with ongoing chat conversations/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1,16 @@ | |||
# Benchmarking Text Generation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use kebab-case
in all of the other dirs in this project. Prefer to keep with that convention and rename this to benchmarks/chat-py/
and change the other to be benchmarks/chat-k6/
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
No description provided.