-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support stream response #43
Comments
The stream generation hang has been addressed, refer to #42 Candle-vllm can now generate 71 tokens/s per request on A100 for LLaMa2 7B (BF16) in release mode, which is very close to vLLM (Pytorch backend) (72 tokens/s). |
TODO: batched streaming |
I think the only thing holding up batched streaming is the scheduler, right? Is there anything else that you can think of which would need to change? |
I think the client (ChatUI) may also need to be revised to support batched streaming. Currently, the frontend and backend (candle-vllm) are both designed to process a single query (batch size = 1) at a time. For example, the chat request and completion messages derived from the OpenAI API were designed on the basis of a single-user response (history messages of a given user). These need to be revised to handle messages from multiple users. A unique user ID is thus needed to indicate which messages belong to which user. The backend service will then accept chat completion requests that contain history messages from multiple users, stack and pad them into tensorized input tokens (batch size > 1). The backend also needs to respond with decoding results according to the finished status, e.g., in a single stacked request, some user messages are finished (removed from streaming pipeline) while others are still in decoding (streaming). While there is another strategy that does not require changes to the client side. The backend service can accept chat completion requests from different users, and a message queue and a standalone backend thread can be used to grab multiple completion requests from the queue and process them in batches. The current strategy of candle-vllm for processing chat completion requests is to process the current request while blocking incoming requests (model.lock, pipeline.lock) until the current request is finished. |
This have been supported in #69 |
Streaming response and batched streaming response have been supported. |
Open this issue to track the progress of stream response feature in candle-vllm.
The text was updated successfully, but these errors were encountered: