Support stream response #43

guoqingbao · 2024-06-26T02:11:32Z

Open this issue to track the progress of stream response feature in candle-vllm.

guoqingbao · 2024-06-26T08:07:18Z

Current progress:

66 tokens/s on A100 for LLaMa2 7B (BF16)

Note: there is a problem for candle-vllm release version on certain environments (Rust tokio runtime cpu usage abnormal), tring to fix. Use the debug version instead at the moment.

guoqingbao · 2024-06-28T03:57:14Z

The stream generation hang has been addressed, refer to #42 Candle-vllm can now generate 71 tokens/s per request on A100 for LLaMa2 7B (BF16) in release mode, which is very close to vLLM (Pytorch backend) (72 tokens/s).

guoqingbao · 2024-07-04T10:31:04Z

TODO: batched streaming

EricLBuehler · 2024-07-04T10:52:16Z

TODO: batched streaming

I think the only thing holding up batched streaming is the scheduler, right? Is there anything else that you can think of which would need to change?

guoqingbao · 2024-07-04T11:15:51Z

TODO: batched streaming

I think the only thing holding up batched streaming is the scheduler, right? Is there anything else that you can think of which would need to change?

I think the client (ChatUI) may also need to be revised to support batched streaming. Currently, the frontend and backend (candle-vllm) are both designed to process a single query (batch size = 1) at a time. For example, the chat request and completion messages derived from the OpenAI API were designed on the basis of a single-user response (history messages of a given user). These need to be revised to handle messages from multiple users. A unique user ID is thus needed to indicate which messages belong to which user. The backend service will then accept chat completion requests that contain history messages from multiple users, stack and pad them into tensorized input tokens (batch size > 1). The backend also needs to respond with decoding results according to the finished status, e.g., in a single stacked request, some user messages are finished (removed from streaming pipeline) while others are still in decoding (streaming).

While there is another strategy that does not require changes to the client side. The backend service can accept chat completion requests from different users, and a message queue and a standalone backend thread can be used to grab multiple completion requests from the queue and process them in batches. The current strategy of candle-vllm for processing chat completion requests is to process the current request while blocking incoming requests (model.lock, pipeline.lock) until the current request is finished.

guoqingbao · 2024-07-30T06:28:50Z

TODO: batched streaming

This have been supported in #69

guoqingbao · 2024-07-30T06:29:21Z

Streaming response and batched streaming response have been supported.

guoqingbao mentioned this issue Jun 26, 2024

Support stream chat completion & optimization for decoding stage #42

Merged

guoqingbao self-assigned this Jul 19, 2024

guoqingbao added the enhancement New feature or request label Jul 19, 2024

guoqingbao closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support stream response #43

Support stream response #43

guoqingbao commented Jun 26, 2024

guoqingbao commented Jun 26, 2024

guoqingbao commented Jun 28, 2024 •

edited

Loading

guoqingbao commented Jul 4, 2024 •

edited

Loading

EricLBuehler commented Jul 4, 2024

guoqingbao commented Jul 4, 2024

guoqingbao commented Jul 30, 2024

guoqingbao commented Jul 30, 2024

Support stream response #43

Support stream response #43

Comments

guoqingbao commented Jun 26, 2024

guoqingbao commented Jun 26, 2024

guoqingbao commented Jun 28, 2024 • edited Loading

guoqingbao commented Jul 4, 2024 • edited Loading

EricLBuehler commented Jul 4, 2024

guoqingbao commented Jul 4, 2024

guoqingbao commented Jul 30, 2024

guoqingbao commented Jul 30, 2024

guoqingbao commented Jun 28, 2024 •

edited

Loading

guoqingbao commented Jul 4, 2024 •

edited

Loading