Reduced performance/bottleneck with concurrent requests and Llama-3.1 #127

thigger · 2024-08-01T15:14:13Z

Using TabbyAPI/exllamav2 with Llama3.1-8B
Threadripper Pro/A6000 GPU

Inference at ~70t/s unconstrained, single request. ~35t/s with lm-format-enforcer (JSON schema)
Running 30 simultaneous requests performance drops to ~1-2t/s, with CUDA use at ~10%. This does not occur if the lm-format-enforcer is not used (90-100% CUDA use with 10-20t/s on each request)

@turboderp has been able to replicate and suggests it is due to the large Llama3.1 vocabulary combined with the GIL forcing single-threaded behaviour.

Is this likely to be fixable or is it too complex? Thanks!

noamgat · 2024-09-03T19:40:34Z

I think the correct way to approach this would probably be to use some multiprocessing / queue setup, but it would have to be deeply integrated with exllamav2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduced performance/bottleneck with concurrent requests and Llama-3.1 #127

Reduced performance/bottleneck with concurrent requests and Llama-3.1 #127

thigger commented Aug 1, 2024

noamgat commented Sep 3, 2024

Reduced performance/bottleneck with concurrent requests and Llama-3.1 #127

Reduced performance/bottleneck with concurrent requests and Llama-3.1 #127

Comments

thigger commented Aug 1, 2024

noamgat commented Sep 3, 2024