Fix pipeline generation (kernel launch, kernel compilation, rwlock, paged attention, etc.) #37

guoqingbao · 2024-06-07T10:24:15Z

Hi Eric,

As mentioned in the email, this PR aims to make candle-vllm workable with chat completion requests. Key changes include:

Kernel Launch: adjustments for rotary embedding, reshape and cache, and paged attention.

Kernel Compilation: replaced with bindgen_cuda for building both PTX and CUDA libraries, and paged attention is now launched with FFI instead of PTX.

Weight Loading: supports local weights, and the candle default weight loader (VarBuilder) was used for faster loading.

Lock Usage: replaced Mutex with RWLock for sequence operations to prevent potential deadlocks.

Pipeline Generation: several fixes have been implemented to make the pipeline workable.

Additionally, I've made flash attention optional due to the significant time it takes to build during each run. I'm currently working on ensuring the correctness of the inference results and hope to address it in future PRs.

Test case:

cargo run -- --port 65320 --weight-path /home/llama2_7b/ llama7b --repeat-last-n 64

curl -X POST "http://127.0.0.1:65320/v1/chat/completions"      -H "Content-Type: application/json"      -H "Authorization: Bearer YOUR_API_KEY"      -d '{
           "model": "llama7b",
           "messages": [
               {"role": "user", "content": "Hello!"},
               {"role": "assistant", "content": "How can I assist you today?"}
           ],
           "temperature": 0.7
       }'

Response:

{"id":"cmpl-a166148e-7e9c-40b4-94f4-42f29a95e3a4","choices":[{"message":{"content":"给给给给给给给给给给给给给给给给给","role":"[INST]"},"finish_reason":"length","index":0,"logprobs":{"content":[{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]}]}}],"created":1717748857,"model":"llama7b","object":"chat.completion","usage":{"completion_tokens":17,"prompt_tokens":35,"total_tokens":52}}

EricLBuehler

@guoqingbao looks like Typos and Rustfmt are failing in CI, after that is fixed I will be happy to merge!

EricLBuehler · 2024-06-10T12:54:45Z

Thank you!

guoqingbao · 2024-06-11T02:39:52Z

Thank you!

Thanks, Eric.

guoqingbao added 5 commits June 6, 2024 10:42

Fix slow weight loading, support local weights & optional flash-attn

4899a1d

Fix kernel launch, fix deadlock & fix kernel compilation,

184218e

Support paged attention kernel, fix pipeline generation

12631a0

Use updated bindgen_cuda which supports building both ptx and cuda libs

953117f

Cargo fmt

7ac523f

EricLBuehler approved these changes Jun 10, 2024

View reviewed changes

EricLBuehler merged commit 431c633 into EricLBuehler:master Jun 10, 2024
3 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix pipeline generation (kernel launch, kernel compilation, rwlock, paged attention, etc.) #37

Fix pipeline generation (kernel launch, kernel compilation, rwlock, paged attention, etc.) #37

guoqingbao commented Jun 7, 2024

EricLBuehler left a comment

EricLBuehler commented Jun 10, 2024

guoqingbao commented Jun 11, 2024

Fix pipeline generation (kernel launch, kernel compilation, rwlock, paged attention, etc.) #37

Fix pipeline generation (kernel launch, kernel compilation, rwlock, paged attention, etc.) #37

Conversation

guoqingbao commented Jun 7, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment

EricLBuehler commented Jun 10, 2024

guoqingbao commented Jun 11, 2024