-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable kvcache & fix repeat chat history #41
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR, it looks great! I just made a few comments regarding some magic values, and calculating the number of CPU blocks.
I have created a demo chat for candle-vllm, which includes instructions for running a candle-vllm-backed chat conversation and a demo video. If you think it is suitable, I can include it in the ReadMe of candle-vllm. |
That would be great, please feel free to do so! |
The instructions for ChatUI and the demo video have been added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, looks good!
In this PR, several issues related to chat history have been fixed, making candle-vllm compatible with ChatGPT-like frontends (chat UIs).
The use of kvcache can now be configured with the parameter
kvcache_mem
, which indicates the size of GPU memory used for kvcache in MB (default is 4096MB). Candle-vllm can calculate the number of GPU blocks used for kvcache.Chat history management can be configured with the parameter
record_conversation
. By default, candle-vllm does not record chat history; instead, the client sends both the messages and the contextual history to candle-vllm. This resolves the issue of repeated chat history recorded by both clients and candle-vllm. Ifrecord_conversation
is set to True, the client sends only new chat messages to candle-vllm, and candle-vllm is responsible for recording the previous chat messages. However, this approach requires per-session chat recording, which is not yet implemented, so the default approach is recommended.I discovered that while the kvcache is cleared after each request, the corresponding chat history in candle-vllm was not cleared, causing previous chats to affect new ones. This PR also addresses this issue.
I will create a demo video to showcase the chat conversation using popular ChatUI(s) with candle-vllm as the backend service.