Support stream chat completion & optimization for decoding stage #42

guoqingbao · 2024-06-24T08:54:40Z

Key changes:

Support stream chat completion (response decoding results in realtime, for client using OpenAI API stream=True) Support stream response #43
Remove redundant transpose operation in the decoding stage (when sequence length equals 1).
Attempt to fix the block table issue when retrieving the block number.

So far, it can achieve 71 tokens/s (bf16) for LLaMa2 7B model on A100.

cargo run --release -- --port 2000 --weight-path /home/llama2_7b/ llama7b --repeat-last-n 64

Tested UI:

https://github.com/anse-app/chatgpt-demo

The demo video and ReadMe have been updated accordingly.

EricLBuehler · 2024-06-26T10:59:58Z

Hi @guoqingbao , do you have a number for the performance enhancement by this PR?

guoqingbao · 2024-06-26T11:14:56Z

Hi @guoqingbao , do you have a number for the performance enhancement by this PR?

Hi Eric, this PR made chat response token by token and the performance improvements were mainly came from the removal of the unnecessary transpose ops, I think the improvement regarding the generation speed is marginal, say 5<%. The fairly good speed achieved here is dependent on the paged attention in my view. 😀

EricLBuehler · 2024-06-26T11:34:31Z

That sounds great! Now the big question: How does performance compare to vllm in the same conditions?

guoqingbao · 2024-06-26T12:19:05Z

That sounds great! Now the big question: How does performance compare to vllm in the same conditions?

Perhaps we can evaluate vllm on A100 for the performance of single query response. I haven't test multiquery performance of candle-vllm.

guoqingbao · 2024-06-26T12:32:02Z

Do you encounter problems in the release mode? I found the stream response stucked in release mode in one A100 server but not on the another, the debug mode has no such issue. The tokio runtime mysteriously hang at the second request (cpu usage 100%). I believe the release mode can achieve faster generation speed.

EricLBuehler · 2024-06-26T18:11:51Z

Perhaps we can evaluate vllm on A100 for the performance of single query response. I haven't test multiquery performance of candle-vllm.

Looks like vLLM gets 72 T/s while we are at 67 T/s: ~7% behind, and that is in debug mode. For multiquery, the second request takes 2x as long for some reason? Maybe that is tied to the strange tokio behavior.

guoqingbao · 2024-06-27T01:39:36Z

Perhaps we can evaluate vllm on A100 for the performance of single query response. I haven't test multiquery performance of candle-vllm.

Looks like vLLM gets 72 T/s while we are at 67 T/s: ~7% behind, and that is in debug mode. For multiquery, the second request takes 2x as long for some reason? Maybe that is tied to the strange tokio behavior.

That's very close to vLLM. For multiquery, I mean batch size > 1, in the single-query setting (one user, batchsize=1) that I have tested, the candle-vllm can generate around 66 t/s in the decoding stage thanks to the paged attention (kv cache is no longer the bottleneck), even after tens of requests. Do you mean the 2x long is for batch size=2 or second request in single-query setting (batchsize=1)? The debug mode should has such issue. I encounter problem for the second request in release mode on one server but not on the another, perhaps the library dependency is different for tokio.

guoqingbao · 2024-06-28T03:55:10Z

Perhaps we can evaluate vllm on A100 for the performance of single query response. I haven't test multiquery performance of candle-vllm.

Looks like vLLM gets 72 T/s while we are at 67 T/s: ~7% behind, and that is in debug mode. For multiquery, the second request takes 2x as long for some reason? Maybe that is tied to the strange tokio behavior.

Hi Eric,

I have managed to fix the stream generation hang in release mode by revising the generation function and sender to async. In release mode, candle-vllm now can generate 71 tokens/s on A100 (BF16) for LLaMa2 7B model. Please feel free to test and merge! :)

#43

EricLBuehler · 2024-06-28T07:50:42Z

That is great. Thanks for the hard work!

I'll merge this soon, but could you please test the scenario where multiple requests are being run at once vs vllm?

guoqingbao · 2024-06-28T08:36:39Z

That is great. Thanks for the hard work!

I'll merge this soon, but could you please test the scenario where multiple requests are being run at once vs vllm?

I haven't do the logic for processing multiple requests (batched inputs) at same time. It requires some heavy revision for accepting chat completion requests (stacking) and padding for tokenization, etc. I will try to add batched processing in the future PR.

…tion

EricLBuehler

Looks good, thank you!

guoqingbao added 10 commits June 19, 2024 15:02

Optional logprobs & fix llama eos/stop token

b7c2e3d

Cargo fmt

a449cad

Mention other options for chat completion request

f7f1988

Merge branch 'EricLBuehler:master' into master

61cc400

Configurable kvcache & fix repeat chat history

ae7f54c

Improve readability

78f184c

Merge branch 'master' of github.com:guoqingbao/candle-vllm

b476402

Instructions for ChatUI & add demo chat video

597aaec

Merge branch 'EricLBuehler:master' into master

e5b93a0

Optimization for decoding stage & try to fix blocktable issue

efe8c46

guoqingbao changed the title ~~Optimization for decoding stage & try to fix blocktable issue~~ Support stream chat completion & optimization for decoding stage Jun 26, 2024

guoqingbao added 4 commits June 26, 2024 15:04

Support stream response for chat completion

7c13746

Merge branch 'master' of github.com:guoqingbao/candle-vllm

aab3a40

Update ReadMe & demo video

e2e8436

Reduce demo video size

30d4e99

Fix stream generation hang in release mode

0d9fc0b

guoqingbao mentioned this pull request Jun 28, 2024

Support stream response #43

Closed

guoqingbao added 3 commits June 28, 2024 12:34

Reduce the buffer size & update ReadMe

7a137aa

Fix LLaMa2 prompt instruction (for long conversation)

37c760a

Cargo fmt

6d21d9e

Padding to avoid block allocation issue & revision for prompt instruc…

47ae004

…tion

EricLBuehler approved these changes Jun 29, 2024

View reviewed changes

EricLBuehler merged commit ae35a3a into EricLBuehler:master Jun 29, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support stream chat completion & optimization for decoding stage #42

Support stream chat completion & optimization for decoding stage #42

guoqingbao commented Jun 24, 2024 •

edited

Loading

EricLBuehler commented Jun 26, 2024

guoqingbao commented Jun 26, 2024

EricLBuehler commented Jun 26, 2024

guoqingbao commented Jun 26, 2024

guoqingbao commented Jun 26, 2024 •

edited

Loading

EricLBuehler commented Jun 26, 2024 •

edited

Loading

guoqingbao commented Jun 27, 2024

guoqingbao commented Jun 28, 2024

EricLBuehler commented Jun 28, 2024

guoqingbao commented Jun 28, 2024

EricLBuehler left a comment

Support stream chat completion & optimization for decoding stage #42

Support stream chat completion & optimization for decoding stage #42

Conversation

guoqingbao commented Jun 24, 2024 • edited Loading

EricLBuehler commented Jun 26, 2024

guoqingbao commented Jun 26, 2024

EricLBuehler commented Jun 26, 2024

guoqingbao commented Jun 26, 2024

guoqingbao commented Jun 26, 2024 • edited Loading

EricLBuehler commented Jun 26, 2024 • edited Loading

guoqingbao commented Jun 27, 2024

guoqingbao commented Jun 28, 2024

EricLBuehler commented Jun 28, 2024

guoqingbao commented Jun 28, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment

guoqingbao commented Jun 24, 2024 •

edited

Loading

guoqingbao commented Jun 26, 2024 •

edited

Loading

EricLBuehler commented Jun 26, 2024 •

edited

Loading