Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --cpu flag to mistralrs-server #997

Merged
merged 6 commits into from
Dec 24, 2024

Conversation

cdoko
Copy link
Contributor

@cdoko cdoko commented Dec 19, 2024

Convenience flag to allow users to run the model purely on the CPU. If you have any feedback or would like to request changes, please let me know!

Copy link

github-actions bot commented Dec 19, 2024

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           41           22           10            9
 JSON                   12          105          104            0            1
 Python                 63         2706         2338           71          297
 Shell                   1           57           22           18           17
 Plain Text              3         3723            0         2413         1310
 TOML                   18          603          536            2           65
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               43         3324            0         2520          804
 |- BASH                 6          101           98            0            3
 |- JSON                 1           12           12            0            0
 |- Python               7          121          109            0           12
 |- Rust                12          406          344            0           62
 |- TOML                 2           75           63            0           12
 (Total)                           4039          626         2520          893
-------------------------------------------------------------------------------
 Rust                  289        87931        78962         1804         7165
 |- Markdown           139         1532           25         1393          114
 (Total)                          89463        78987         3197         7279
===============================================================================
 Total                 438        98546        82031         6840         9675
===============================================================================
  

mistralrs-server/src/main.rs Outdated Show resolved Hide resolved
@EricLBuehler
Copy link
Owner

The PR looks great, my only comment was enhancing the docs for one method you used. I think that perhaps you should remove its usage here, as the default is what is intended to be used?

@cdoko
Copy link
Contributor Author

cdoko commented Dec 20, 2024

I understand GEMM being a CUDA-specific feature here, but I'd like to clarify that my intention behind adding .with_gemm_full_precision_f16(args.cpu) was to prevent the set_gemm_reduced_precision_f16() function from being called when running on CPU. This is because the cuda feature flag is still present in the build, and the set_gemm_reduced_precision_f16() function is triggered based on the presence of this flag, not the actual device being used. By setting gemm_full_precision_f16 to true when --cpu is used, I aimed to avoid unnecessary GEMM configuration. Please let me know if I'm misunderstanding the code.

On a separate note, I had a question regarding the PagedAttention configuration. When using device mapping, I see a message saying "Device mapping or device topology and PagedAttention are incompatible, disabling PagedAttention." Is PagedAttention inherently incompatible with multi-GPU setups, or is it just not yet implemented in Mistral.rs? If it's the latter, what would it take to support PagedAttention with device mapping or multi-GPU setups in the future?

@EricLBuehler
Copy link
Owner

@cdoko

By setting gemm_full_precision_f16 to true when --cpu is used, I aimed to avoid unnecessary GEMM configuration. Please let me know if I'm misunderstanding the code.

I think it'd be best to not include this - since the automatic behavior is probably sufficient for now, I think it would be fine.

On a separate note, I had a question regarding the PagedAttention configuration. When using device mapping, I see a message saying "Device mapping or device topology and PagedAttention are incompatible, disabling PagedAttention." Is PagedAttention inherently incompatible with multi-GPU setups, or is it just not yet implemented in Mistral.rs? If it's the latter, what would it take to support PagedAttention with device mapping or multi-GPU setups in the future?

This simply hasn't been implemented yet. If you'd like to, I think all that would be necessary is to modify how the Paged Attention KV cache is allocated by providing the device mapping. I'd be happy to accept such a change!

@cdoko
Copy link
Contributor Author

cdoko commented Dec 22, 2024

I think it'd be best to not include this - since the automatic behavior is probably sufficient for now, I think it would be fine.

I've removed the line.

This simply hasn't been implemented yet. If you'd like to, I think all that would be necessary is to modify how the Paged Attention KV cache is allocated by providing the device mapping. I'd be happy to accept such a change!

Thanks for the info on PagedAttention!

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much! I noticed that the formatting check failed, can you please run cargo fmt --all?

@cdoko
Copy link
Contributor Author

cdoko commented Dec 24, 2024

The Rustfmt check reported a failure, but the suggested fix is identical to the original, there might be a potential bug in the tool itself. I modified the comment to manually satisfy the check.

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@EricLBuehler EricLBuehler merged commit 9a208a9 into EricLBuehler:master Dec 24, 2024
12 checks passed
@cdoko cdoko deleted the cpu-cli-arg branch December 29, 2024 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants