Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pipeline generation (kernel launch, kernel compilation, rwlock, paged attention, etc.) #37

Merged
merged 5 commits into from
Jun 10, 2024

Conversation

guoqingbao
Copy link
Collaborator

Hi Eric,

As mentioned in the email, this PR aims to make candle-vllm workable with chat completion requests. Key changes include:

Kernel Launch: adjustments for rotary embedding, reshape and cache, and paged attention.

Kernel Compilation: replaced with bindgen_cuda for building both PTX and CUDA libraries, and paged attention is now launched with FFI instead of PTX.

Weight Loading: supports local weights, and the candle default weight loader (VarBuilder) was used for faster loading.

Lock Usage: replaced Mutex with RWLock for sequence operations to prevent potential deadlocks.

Pipeline Generation: several fixes have been implemented to make the pipeline workable.

Additionally, I've made flash attention optional due to the significant time it takes to build during each run. I'm currently working on ensuring the correctness of the inference results and hope to address it in future PRs.

Test case:

cargo run -- --port 65320 --weight-path /home/llama2_7b/ llama7b --repeat-last-n 64

curl -X POST "http://127.0.0.1:65320/v1/chat/completions"      -H "Content-Type: application/json"      -H "Authorization: Bearer YOUR_API_KEY"      -d '{
           "model": "llama7b",
           "messages": [
               {"role": "user", "content": "Hello!"},
               {"role": "assistant", "content": "How can I assist you today?"}
           ],
           "temperature": 0.7
       }'

Response:

{"id":"cmpl-a166148e-7e9c-40b4-94f4-42f29a95e3a4","choices":[{"message":{"content":"给给给给给给给给给给给给给给给给给","role":"[INST]"},"finish_reason":"length","index":0,"logprobs":{"content":[{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]},{"token":31999,"logprob":0.0,"bytes":"给","top_logprobs":[]}]}}],"created":1717748857,"model":"llama7b","object":"chat.completion","usage":{"completion_tokens":17,"prompt_tokens":35,"total_tokens":52}}

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guoqingbao looks like Typos and Rustfmt are failing in CI, after that is fixed I will be happy to merge!

@EricLBuehler EricLBuehler merged commit 431c633 into EricLBuehler:master Jun 10, 2024
3 of 5 checks passed
@EricLBuehler
Copy link
Owner

Thank you!

@guoqingbao
Copy link
Collaborator Author

Thank you!

Thanks, Eric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants