You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using TabbyAPI/exllamav2 with Llama3.1-8B
Threadripper Pro/A6000 GPU
Inference at ~70t/s unconstrained, single request. ~35t/s with lm-format-enforcer (JSON schema)
Running 30 simultaneous requests performance drops to ~1-2t/s, with CUDA use at ~10%. This does not occur if the lm-format-enforcer is not used (90-100% CUDA use with 10-20t/s on each request)
@turboderp has been able to replicate and suggests it is due to the large Llama3.1 vocabulary combined with the GIL forcing single-threaded behaviour.
Is this likely to be fixable or is it too complex? Thanks!
The text was updated successfully, but these errors were encountered:
I think the correct way to approach this would probably be to use some multiprocessing / queue setup, but it would have to be deeply integrated with exllamav2.
Using TabbyAPI/exllamav2 with Llama3.1-8B
Threadripper Pro/A6000 GPU
Inference at ~70t/s unconstrained, single request. ~35t/s with lm-format-enforcer (JSON schema)
Running 30 simultaneous requests performance drops to ~1-2t/s, with CUDA use at ~10%. This does not occur if the lm-format-enforcer is not used (90-100% CUDA use with 10-20t/s on each request)
@turboderp has been able to replicate and suggests it is due to the large Llama3.1 vocabulary combined with the GIL forcing single-threaded behaviour.
Is this likely to be fixable or is it too complex? Thanks!
The text was updated successfully, but these errors were encountered: