-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Sending two requests asking for streamed response kills the server #26
Comments
Yes, this is a limit I haven't resolved yet. The only solution I have in the short term is to just block and only process one request at a time, which is probably better done client side so you don't get request timeouts. Is this something that you need? Dynamic batching is a much more complex solution with vision models, they don't have any consistent ways to batch, sometimes image contexts can be batched sometimes they can't. Most of the time only the chat can be batched, not the image context. this is inconsistent with the expectations of the API so I have not implemented it at all. The only practical solution I can suggest is to load multiple copies of the server running on different ports, perhaps with a load balancer in front. This is not a good general solution because vision models are typically huge and this would require enormous vram. So again this is not implemented. |
Actually for our hackathon we needed to be able to precess multiple requests in parallel and were lucky TGI by Huggingface just recently added Support for MLlama so we currently don't use openedai-vision for now. Just wanted to make sure you are aware of the bug. |
No problem, you may also be interested in knowing that vllm supports a few of the good vision models and is great for multiple concurrent requests. |
Batching not supported yet, but simultaneous requests should no longer hang the server as of 0.42.0 |
See title.
If you try to have it handle 2 Requests in parallel with streaming the response it starts, but dies half way through the answer.
The text was updated successfully, but these errors were encountered: