Ollama VS vLLM: Google Cloud Run

I've deployed Llama3.2 3B using both Ollama and vLLM on Google Cloud Run.

Our end objective is this:

Fast as fuck cold start for autoscaling
Low-latency inference: very low TTFT & delay between tokens for realtime apps

I've run the following experiments:

Sequential, one Cold-start: 10 sequential requests when the zero container instances are running
Concurrent, Cold-start: 10 concurrent requests when the zero container instances are running
Concurrent, Warm: 10 concurrent requests when the one container instance is running
Batched concurrent, Warm: 10 batched concurrent requests, from 1 to 10, when the one container instance is running

Data

Data was collected using the code in src/ directory. src/latency-tracker.js was used by sequential and concurrent drivers to collect the data. Then it was processed and stats were calculated using the code in src/stats.js.

The data is available in the data directory.

Analysis

Analysis is available at data/analysis.ipynb.

Observation

Cold-start latency is higher in vLLM than Ollama, but once it's up and running, TTFT is lower in vLLM.
As we increased the concurrency, ollama started to perform worse on TTFT (Time to First Token). vLLM's TTFT for the same experiment was under 0.6 seconds, compared to Ollama's, whose performance started degrading from 5 concurrent requests.
In best case (i.e no concurrency), Ollama was able to serve at 63 tokens per second, while vLLM was only able to serve at 31-35 tokens per seconds for all levels of concurrency.
EDIT: I increased OLLAMA_NUM_PARALLEL to 10 and noticed Ollama was faster than vLLM for 10 concurrent requests.

Unknowns & Questions to explore

We have to understand how these inference engines are utilizing the GPUs. If they are under utilizing, then we can expect to see a higher latency.
I got error deploying with vLLM on both Google Cloud Run and when I did it on EC2. The error was about max sequence length exceeding GPU memory.

ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (117328). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

I decreased max_model_len to 4096. Also note that vLLM by default uses 0.9 of the GPU memory. The same is unknown to me for Ollama.

Next steps

Optimize the shit out of vLLM or Ollama whatever.

And I read about run.ai's Model Streamer. Their benchmarks are impressive. Here they are.. It is supported by vLLM.

There's also tensorizer.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ollama VS vLLM: Google Cloud Run

Data

Analysis

Observation

Unknowns & Questions to explore

Next steps

About

Releases

Packages

Languages

biraj-outspeed/ollama-vs-vllm

Folders and files

Latest commit

History

Repository files navigation

Ollama VS vLLM: Google Cloud Run

Data

Analysis

Observation

Unknowns & Questions to explore

Next steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages