Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Sending two requests asking for streamed response kills the server #26

Closed
Cyb4Black opened this issue Nov 18, 2024 · 4 comments
Closed
Labels
bug Something isn't working

Comments

@Cyb4Black
Copy link

Cyb4Black commented Nov 18, 2024

See title.
If you try to have it handle 2 Requests in parallel with streaming the response it starts, but dies half way through the answer.

@Cyb4Black Cyb4Black changed the title Sending two requests asking for streamed response kills the server [BUG] Sending two requests asking for streamed response kills the server Nov 18, 2024
@matatonic
Copy link
Owner

Yes, this is a limit I haven't resolved yet. The only solution I have in the short term is to just block and only process one request at a time, which is probably better done client side so you don't get request timeouts. Is this something that you need?

Dynamic batching is a much more complex solution with vision models, they don't have any consistent ways to batch, sometimes image contexts can be batched sometimes they can't. Most of the time only the chat can be batched, not the image context. this is inconsistent with the expectations of the API so I have not implemented it at all.

The only practical solution I can suggest is to load multiple copies of the server running on different ports, perhaps with a load balancer in front. This is not a good general solution because vision models are typically huge and this would require enormous vram. So again this is not implemented.

@Cyb4Black
Copy link
Author

Actually for our hackathon we needed to be able to precess multiple requests in parallel and were lucky TGI by Huggingface just recently added Support for MLlama so we currently don't use openedai-vision for now.

Just wanted to make sure you are aware of the bug.

@matatonic
Copy link
Owner

No problem, you may also be interested in knowing that vllm supports a few of the good vision models and is great for multiple concurrent requests.

@matatonic matatonic added the bug Something isn't working label Nov 18, 2024
@matatonic
Copy link
Owner

Batching not supported yet, but simultaneous requests should no longer hang the server as of 0.42.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants