-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --cpu
flag to mistralrs-server
#997
Conversation
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== C Header 2 35 28 0 7 Dockerfile 1 41 22 10 9 JSON 12 105 104 0 1 Python 63 2706 2338 71 297 Shell 1 57 22 18 17 Plain Text 3 3723 0 2413 1310 TOML 18 603 536 2 65 YAML 2 21 19 2 0 ------------------------------------------------------------------------------- Jupyter Notebooks 4 0 0 0 0 |- Markdown 2 77 32 31 14 |- Python 2 205 178 1 26 (Total) 282 210 32 40 ------------------------------------------------------------------------------- Markdown 43 3324 0 2520 804 |- BASH 6 101 98 0 3 |- JSON 1 12 12 0 0 |- Python 7 121 109 0 12 |- Rust 12 406 344 0 62 |- TOML 2 75 63 0 12 (Total) 4039 626 2520 893 ------------------------------------------------------------------------------- Rust 289 87931 78962 1804 7165 |- Markdown 139 1532 25 1393 114 (Total) 89463 78987 3197 7279 =============================================================================== Total 438 98546 82031 6840 9675 =============================================================================== |
The PR looks great, my only comment was enhancing the docs for one method you used. I think that perhaps you should remove its usage here, as the default is what is intended to be used? |
I understand GEMM being a CUDA-specific feature here, but I'd like to clarify that my intention behind adding On a separate note, I had a question regarding the PagedAttention configuration. When using device mapping, I see a message saying "Device mapping or device topology and PagedAttention are incompatible, disabling PagedAttention." Is PagedAttention inherently incompatible with multi-GPU setups, or is it just not yet implemented in Mistral.rs? If it's the latter, what would it take to support PagedAttention with device mapping or multi-GPU setups in the future? |
I think it'd be best to not include this - since the automatic behavior is probably sufficient for now, I think it would be fine.
This simply hasn't been implemented yet. If you'd like to, I think all that would be necessary is to modify how the Paged Attention KV cache is allocated by providing the device mapping. I'd be happy to accept such a change! |
I've removed the line.
Thanks for the info on PagedAttention! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much! I noticed that the formatting check failed, can you please run cargo fmt --all
?
The Rustfmt check reported a failure, but the suggested fix is identical to the original, there might be a potential bug in the tool itself. I modified the comment to manually satisfy the check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Convenience flag to allow users to run the model purely on the CPU. If you have any feedback or would like to request changes, please let me know!