Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support stream chat completion & optimization for decoding stage #42

Merged
merged 19 commits into from
Jun 29, 2024

Conversation

guoqingbao
Copy link
Collaborator

@guoqingbao guoqingbao commented Jun 24, 2024

Key changes:

  1. Support stream chat completion (response decoding results in realtime, for client using OpenAI API stream=True) Support stream response #43
  2. Remove redundant transpose operation in the decoding stage (when sequence length equals 1).
  3. Attempt to fix the block table issue when retrieving the block number.

So far, it can achieve 71 tokens/s (bf16) for LLaMa2 7B model on A100.

cargo run --release -- --port 2000 --weight-path /home/llama2_7b/ llama7b --repeat-last-n 64

Tested UI:

https://github.com/anse-app/chatgpt-demo

The demo video and ReadMe have been updated accordingly.

@guoqingbao guoqingbao changed the title Optimization for decoding stage & try to fix blocktable issue Support stream chat completion & optimization for decoding stage Jun 26, 2024
@EricLBuehler
Copy link
Owner

Hi @guoqingbao , do you have a number for the performance enhancement by this PR?

@guoqingbao
Copy link
Collaborator Author

Hi @guoqingbao , do you have a number for the performance enhancement by this PR?

Hi Eric, this PR made chat response token by token and the performance improvements were mainly came from the removal of the unnecessary transpose ops, I think the improvement regarding the generation speed is marginal, say 5<%. The fairly good speed achieved here is dependent on the paged attention in my view. 😀

@EricLBuehler
Copy link
Owner

That sounds great! Now the big question: How does performance compare to vllm in the same conditions?

@guoqingbao
Copy link
Collaborator Author

That sounds great! Now the big question: How does performance compare to vllm in the same conditions?

Perhaps we can evaluate vllm on A100 for the performance of single query response. I haven't test multiquery performance of candle-vllm.

@guoqingbao
Copy link
Collaborator Author

guoqingbao commented Jun 26, 2024

Do you encounter problems in the release mode? I found the stream response stucked in release mode in one A100 server but not on the another, the debug mode has no such issue. The tokio runtime mysteriously hang at the second request (cpu usage 100%). I believe the release mode can achieve faster generation speed.

@EricLBuehler
Copy link
Owner

EricLBuehler commented Jun 26, 2024

Perhaps we can evaluate vllm on A100 for the performance of single query response. I haven't test multiquery performance of candle-vllm.

Looks like vLLM gets 72 T/s while we are at 67 T/s: ~7% behind, and that is in debug mode. For multiquery, the second request takes 2x as long for some reason? Maybe that is tied to the strange tokio behavior.

@guoqingbao
Copy link
Collaborator Author

Perhaps we can evaluate vllm on A100 for the performance of single query response. I haven't test multiquery performance of candle-vllm.

Looks like vLLM gets 72 T/s while we are at 67 T/s: ~7% behind, and that is in debug mode. For multiquery, the second request takes 2x as long for some reason? Maybe that is tied to the strange tokio behavior.

That's very close to vLLM. For multiquery, I mean batch size > 1, in the single-query setting (one user, batchsize=1) that I have tested, the candle-vllm can generate around 66 t/s in the decoding stage thanks to the paged attention (kv cache is no longer the bottleneck), even after tens of requests. Do you mean the 2x long is for batch size=2 or second request in single-query setting (batchsize=1)? The debug mode should has such issue. I encounter problem for the second request in release mode on one server but not on the another, perhaps the library dependency is different for tokio.

@guoqingbao
Copy link
Collaborator Author

Perhaps we can evaluate vllm on A100 for the performance of single query response. I haven't test multiquery performance of candle-vllm.

Looks like vLLM gets 72 T/s while we are at 67 T/s: ~7% behind, and that is in debug mode. For multiquery, the second request takes 2x as long for some reason? Maybe that is tied to the strange tokio behavior.

Hi Eric,

I have managed to fix the stream generation hang in release mode by revising the generation function and sender to async. In release mode, candle-vllm now can generate 71 tokens/s on A100 (BF16) for LLaMa2 7B model. Please feel free to test and merge! :)

#43

@EricLBuehler
Copy link
Owner

That is great. Thanks for the hard work!

I'll merge this soon, but could you please test the scenario where multiple requests are being run at once vs vllm?

@guoqingbao
Copy link
Collaborator Author

That is great. Thanks for the hard work!

I'll merge this soon, but could you please test the scenario where multiple requests are being run at once vs vllm?

I haven't do the logic for processing multiple requests (batched inputs) at same time. It requires some heavy revision for accepting chat completion requests (stacking) and padding for tokenization, etc. I will try to add batched processing in the future PR.

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

@EricLBuehler EricLBuehler merged commit ae35a3a into EricLBuehler:master Jun 29, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants