-
-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUEST] High throughput with large batch size #686
Comments
You can run with a large batch size if you have the VRAM to store that number of sequences in the cache at once. But throughput and latency are ultimately still connected. If you want to run at bsz 1000 and a context length of 32k or whatever, that means a 32M-token cache. However you manage that, it's going to far outweigh the storage requirement and bandwidth usage for the weights, and at that point why would you even be considering quantization? |
Thanks for the reply! I am mainly trying bs=256 and context around 1-2k, and find vllm/lmdeploy quite fast. |
You can try the |
Thank you! |
@turboderp By the way, I wonder whether it is OK if I use https://github.com/theroyallab/tabbyAPI to test the speed (i.e. will it have nearly same performance as the bulk_inference.py direct batch call)? Currently my code tests vllm / lmdeploy by using their openai compatible server, and send HTTP requests to them. |
Problem
Hi thanks for the library! I hope to use 7B model on 24GB 4090 with as large thoughput as possible (latency is not a problem - it is a batch task). Vllm works well, but it seems that its 8bit kv cache degrades the results a lot (or maybe I do not get it yet). exllamav2 seems to have super good low bit kv cache, thus I would appreciate it if it could have high throughput with large batch size (e.g. batch size = 256).
Solution
(see above)
Alternatives
No response
Explanation
(see above)
Examples
No response
Additional context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: