Benchmarking prefix cache routing (#390)

substratusai · Feb 12, 2025 · a2349bf · a2349bf
1 parent 6d719e5
commit a2349bf
Show file tree

Hide file tree

Showing 12 changed files with 2,478 additions and 2 deletions.
diff --git a/benchmarks/chat-py/.gitignore b/benchmarks/chat-py/.gitignore
@@ -0,0 +1 @@
+sharegpt_16_messages_or_more.json
diff --git a/benchmarks/chat-py/Dockerfile b/benchmarks/chat-py/Dockerfile
@@ -0,0 +1,24 @@
+# Use a lightweight Python base image
+FROM python:3.10
+
+# Set the working directory
+WORKDIR /app
+
+# Copy requirements first to leverage Docker cache
+COPY requirements.txt .
+
+# Install Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy the benchmark serving script
+COPY backend_request_func.py .
+COPY benchmark_serving.py .
+RUN curl -O -L https://huggingface.co/datasets/samos123/share-gpt-long-convos/resolve/main/sharegpt_16_messages_or_more.json
+
+# Set environment variables
+ENV PYTHONPATH=/app
+
+# Define the entrypoint command
+ENTRYPOINT ["python", "benchmark_serving.py"]
+
+CMD ["--dataset-name=sharegpt", "--dataset-path=sharegpt_16_messages_or_more.json"]
diff --git a/benchmarks/chat-py/README.md b/benchmarks/chat-py/README.md
@@ -0,0 +1,16 @@
+# Benchmarking Text Generation
+
+This script was adopted from the vLLM code base. The main differences are:
+- Load the whole conversation as prompts.
+- Limit the amount of max conversations and re-use the same conversation if needed.
+
+This allows us to verify whether prefix aware load balancing provides a performance
+boost under heavy production traffic with ongoing chat conversations.
+
+## Running
+
+Adjust the parameters in the `job.yaml` file and run the job using the following command:
+```
+kubectl apply -f job.yaml
+```
+