Fix pipeline generation (kernel launch, kernel compilation, rwlock, paged attention, etc.) #37
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi Eric,
As mentioned in the email, this PR aims to make candle-vllm workable with chat completion requests. Key changes include:
Kernel Launch: adjustments for rotary embedding, reshape and cache, and paged attention.
Kernel Compilation: replaced with bindgen_cuda for building both PTX and CUDA libraries, and paged attention is now launched with FFI instead of PTX.
Weight Loading: supports local weights, and the candle default weight loader (VarBuilder) was used for faster loading.
Lock Usage: replaced Mutex with RWLock for sequence operations to prevent potential deadlocks.
Pipeline Generation: several fixes have been implemented to make the pipeline workable.
Additionally, I've made flash attention optional due to the significant time it takes to build during each run. I'm currently working on ensuring the correctness of the inference results and hope to address it in future PRs.
Test case:
cargo run -- --port 65320 --weight-path /home/llama2_7b/ llama7b --repeat-last-n 64