Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support stream response #43

Closed
guoqingbao opened this issue Jun 26, 2024 · 7 comments
Closed

Support stream response #43

guoqingbao opened this issue Jun 26, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@guoqingbao
Copy link
Collaborator

Open this issue to track the progress of stream response feature in candle-vllm.

@guoqingbao
Copy link
Collaborator Author

Current progress:

66 tokens/s on A100 for LLaMa2 7B (BF16)

Note: there is a problem for candle-vllm release version on certain environments (Rust tokio runtime cpu usage abnormal), tring to fix. Use the debug version instead at the moment.

candle-vllm-demo

@guoqingbao
Copy link
Collaborator Author

guoqingbao commented Jun 28, 2024

The stream generation hang has been addressed, refer to #42 Candle-vllm can now generate 71 tokens/s per request on A100 for LLaMa2 7B (BF16) in release mode, which is very close to vLLM (Pytorch backend) (72 tokens/s).

@guoqingbao
Copy link
Collaborator Author

guoqingbao commented Jul 4, 2024

TODO: batched streaming

@EricLBuehler
Copy link
Owner

TODO: batched streaming

I think the only thing holding up batched streaming is the scheduler, right? Is there anything else that you can think of which would need to change?

@guoqingbao
Copy link
Collaborator Author

TODO: batched streaming

I think the only thing holding up batched streaming is the scheduler, right? Is there anything else that you can think of which would need to change?

I think the client (ChatUI) may also need to be revised to support batched streaming. Currently, the frontend and backend (candle-vllm) are both designed to process a single query (batch size = 1) at a time. For example, the chat request and completion messages derived from the OpenAI API were designed on the basis of a single-user response (history messages of a given user). These need to be revised to handle messages from multiple users. A unique user ID is thus needed to indicate which messages belong to which user. The backend service will then accept chat completion requests that contain history messages from multiple users, stack and pad them into tensorized input tokens (batch size > 1). The backend also needs to respond with decoding results according to the finished status, e.g., in a single stacked request, some user messages are finished (removed from streaming pipeline) while others are still in decoding (streaming).

While there is another strategy that does not require changes to the client side. The backend service can accept chat completion requests from different users, and a message queue and a standalone backend thread can be used to grab multiple completion requests from the queue and process them in batches. The current strategy of candle-vllm for processing chat completion requests is to process the current request while blocking incoming requests (model.lock, pipeline.lock) until the current request is finished.

@guoqingbao guoqingbao self-assigned this Jul 19, 2024
@guoqingbao guoqingbao added the enhancement New feature or request label Jul 19, 2024
@guoqingbao
Copy link
Collaborator Author

TODO: batched streaming

This have been supported in #69

@guoqingbao
Copy link
Collaborator Author

Streaming response and batched streaming response have been supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants