Add `--cpu` flag to `mistralrs-server` #997

cdoko · 2024-12-19T09:11:23Z

Convenience flag to allow users to run the model purely on the CPU. If you have any feedback or would like to request changes, please let me know!

github-actions · 2024-12-19T09:12:22Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           41           22           10            9
 JSON                   12          105          104            0            1
 Python                 63         2706         2338           71          297
 Shell                   1           57           22           18           17
 Plain Text              3         3723            0         2413         1310
 TOML                   18          603          536            2           65
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               43         3324            0         2520          804
 |- BASH                 6          101           98            0            3
 |- JSON                 1           12           12            0            0
 |- Python               7          121          109            0           12
 |- Rust                12          406          344            0           62
 |- TOML                 2           75           63            0           12
 (Total)                           4039          626         2520          893
-------------------------------------------------------------------------------
 Rust                  289        87931        78962         1804         7165
 |- Markdown           139         1532           25         1393          114
 (Total)                          89463        78987         3197         7279
===============================================================================
 Total                 438        98546        82031         6840         9675
===============================================================================

mistralrs-server/src/main.rs

EricLBuehler · 2024-12-19T13:55:08Z

The PR looks great, my only comment was enhancing the docs for one method you used. I think that perhaps you should remove its usage here, as the default is what is intended to be used?

cdoko · 2024-12-20T10:33:06Z

I understand GEMM being a CUDA-specific feature here, but I'd like to clarify that my intention behind adding .with_gemm_full_precision_f16(args.cpu) was to prevent the set_gemm_reduced_precision_f16() function from being called when running on CPU. This is because the cuda feature flag is still present in the build, and the set_gemm_reduced_precision_f16() function is triggered based on the presence of this flag, not the actual device being used. By setting gemm_full_precision_f16 to true when --cpu is used, I aimed to avoid unnecessary GEMM configuration. Please let me know if I'm misunderstanding the code.

On a separate note, I had a question regarding the PagedAttention configuration. When using device mapping, I see a message saying "Device mapping or device topology and PagedAttention are incompatible, disabling PagedAttention." Is PagedAttention inherently incompatible with multi-GPU setups, or is it just not yet implemented in Mistral.rs? If it's the latter, what would it take to support PagedAttention with device mapping or multi-GPU setups in the future?

EricLBuehler · 2024-12-21T14:41:28Z

@cdoko

By setting gemm_full_precision_f16 to true when --cpu is used, I aimed to avoid unnecessary GEMM configuration. Please let me know if I'm misunderstanding the code.

I think it'd be best to not include this - since the automatic behavior is probably sufficient for now, I think it would be fine.

On a separate note, I had a question regarding the PagedAttention configuration. When using device mapping, I see a message saying "Device mapping or device topology and PagedAttention are incompatible, disabling PagedAttention." Is PagedAttention inherently incompatible with multi-GPU setups, or is it just not yet implemented in Mistral.rs? If it's the latter, what would it take to support PagedAttention with device mapping or multi-GPU setups in the future?

This simply hasn't been implemented yet. If you'd like to, I think all that would be necessary is to modify how the Paged Attention KV cache is allocated by providing the device mapping. I'd be happy to accept such a change!

cdoko · 2024-12-22T09:48:26Z

I think it'd be best to not include this - since the automatic behavior is probably sufficient for now, I think it would be fine.

I've removed the line.

This simply hasn't been implemented yet. If you'd like to, I think all that would be necessary is to modify how the Paged Attention KV cache is allocated by providing the device mapping. I'd be happy to accept such a change!

Thanks for the info on PagedAttention!

EricLBuehler

Thank you so much! I noticed that the formatting check failed, can you please run cargo fmt --all?

cdoko · 2024-12-24T08:45:57Z

The Rustfmt check reported a failure, but the suggested fix is identical to the original, there might be a potential bug in the tool itself. I modified the comment to manually satisfy the check.

EricLBuehler

Thank you!

Add --cpu flag to mistralrs-server

cceceac

EricLBuehler requested changes Dec 19, 2024

View reviewed changes

mistralrs-server/src/main.rs Outdated Show resolved Hide resolved

Update lib.rs

2eefef6

Update main.rs

5f71dd0

EricLBuehler requested changes Dec 22, 2024

View reviewed changes

cdoko added 3 commits December 24, 2024 04:35

Update lib.rs

86aec81

Update lib.rs

8aee239

Update lib.rs

3703e2e

EricLBuehler approved these changes Dec 24, 2024

View reviewed changes

EricLBuehler merged commit 9a208a9 into EricLBuehler:master Dec 24, 2024
12 checks passed

cdoko deleted the cpu-cli-arg branch December 29, 2024 10:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `--cpu` flag to `mistralrs-server` #997

Add `--cpu` flag to `mistralrs-server` #997

cdoko commented Dec 19, 2024

github-actions bot commented Dec 19, 2024 •

edited

Loading

EricLBuehler commented Dec 19, 2024

cdoko commented Dec 20, 2024

EricLBuehler commented Dec 21, 2024

cdoko commented Dec 22, 2024

EricLBuehler left a comment

cdoko commented Dec 24, 2024

EricLBuehler left a comment

Add --cpu flag to mistralrs-server #997

Add --cpu flag to mistralrs-server #997

Conversation

cdoko commented Dec 19, 2024

github-actions bot commented Dec 19, 2024 • edited Loading

EricLBuehler commented Dec 19, 2024

cdoko commented Dec 20, 2024

EricLBuehler commented Dec 21, 2024

cdoko commented Dec 22, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment

cdoko commented Dec 24, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment

Add `--cpu` flag to `mistralrs-server` #997

Add `--cpu` flag to `mistralrs-server` #997

github-actions bot commented Dec 19, 2024 •

edited

Loading