Skip to content

biraj-outspeed/ollama-vs-vllm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ollama VS vLLM: Google Cloud Run

I've deployed Llama3.2 3B using both Ollama and vLLM on Google Cloud Run.

Our end objective is this:

  • Fast as fuck cold start for autoscaling
  • Low-latency inference: very low TTFT & delay between tokens for realtime apps

I've run the following experiments:

  • Sequential, one Cold-start: 10 sequential requests when the zero container instances are running
  • Concurrent, Cold-start: 10 concurrent requests when the zero container instances are running
  • Concurrent, Warm: 10 concurrent requests when the one container instance is running
  • Batched concurrent, Warm: 10 batched concurrent requests, from 1 to 10, when the one container instance is running

Data

Data was collected using the code in src/ directory. src/latency-tracker.js was used by sequential and concurrent drivers to collect the data. Then it was processed and stats were calculated using the code in src/stats.js.

The data is available in the data directory.

Analysis

Analysis is available at data/analysis.ipynb.

Observation

  • Cold-start latency is higher in vLLM than Ollama, but once it's up and running, TTFT is lower in vLLM.

  • As we increased the concurrency, ollama started to perform worse on TTFT (Time to First Token). vLLM's TTFT for the same experiment was under 0.6 seconds, compared to Ollama's, whose performance started degrading from 5 concurrent requests.

  • In best case (i.e no concurrency), Ollama was able to serve at 63 tokens per second, while vLLM was only able to serve at 31-35 tokens per seconds for all levels of concurrency.

  • EDIT: I increased OLLAMA_NUM_PARALLEL to 10 and noticed Ollama was faster than vLLM for 10 concurrent requests.

Unknowns & Questions to explore

  • We have to understand how these inference engines are utilizing the GPUs. If they are under utilizing, then we can expect to see a higher latency.

  • I got error deploying with vLLM on both Google Cloud Run and when I did it on EC2. The error was about max sequence length exceeding GPU memory.

ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (117328). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
  • I decreased max_model_len to 4096. Also note that vLLM by default uses 0.9 of the GPU memory. The same is unknown to me for Ollama.

Next steps

Optimize the shit out of vLLM or Ollama whatever.

And I read about run.ai's Model Streamer. Their benchmarks are impressive. Here they are.. It is supported by vLLM.

There's also tensorizer.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published