-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support stream chat completion & optimization for decoding stage #42
Conversation
Hi @guoqingbao , do you have a number for the performance enhancement by this PR? |
Hi Eric, this PR made chat response token by token and the performance improvements were mainly came from the removal of the unnecessary transpose ops, I think the improvement regarding the generation speed is marginal, say 5<%. The fairly good speed achieved here is dependent on the paged attention in my view. 😀 |
That sounds great! Now the big question: How does performance compare to vllm in the same conditions? |
Perhaps we can evaluate vllm on A100 for the performance of single query response. I haven't test multiquery performance of candle-vllm. |
Do you encounter problems in the release mode? I found the stream response stucked in release mode in one A100 server but not on the another, the debug mode has no such issue. The tokio runtime mysteriously hang at the second request (cpu usage 100%). I believe the release mode can achieve faster generation speed. |
Looks like vLLM gets 72 T/s while we are at 67 T/s: ~7% behind, and that is in debug mode. For multiquery, the second request takes 2x as long for some reason? Maybe that is tied to the strange tokio behavior. |
That's very close to vLLM. For multiquery, I mean batch size > 1, in the single-query setting (one user, batchsize=1) that I have tested, the candle-vllm can generate around 66 t/s in the decoding stage thanks to the paged attention (kv cache is no longer the bottleneck), even after tens of requests. Do you mean the 2x long is for batch size=2 or second request in single-query setting (batchsize=1)? The debug mode should has such issue. I encounter problem for the second request in release mode on one server but not on the another, perhaps the library dependency is different for tokio. |
Hi Eric, I have managed to fix the stream generation hang in release mode by revising the generation function and sender to async. In release mode, candle-vllm now can generate 71 tokens/s on A100 (BF16) for LLaMa2 7B model. Please feel free to test and merge! :) |
That is great. Thanks for the hard work! I'll merge this soon, but could you please test the scenario where multiple requests are being run at once vs vllm? |
I haven't do the logic for processing multiple requests (batched inputs) at same time. It requires some heavy revision for accepting chat completion requests (stacking) and padding for tokenization, etc. I will try to add batched processing in the future PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you!
Key changes:
So far, it can achieve 71 tokens/s (bf16) for LLaMa2 7B model on A100.
Tested UI:
https://github.com/anse-app/chatgpt-demo
The demo video and ReadMe have been updated accordingly.