Support device mapping for Paged Attention #1011

cdoko · 2024-12-28T09:19:23Z

Added support for device mapping in Paged Attention by passing the device list from the mapper to the cache engine. Manual moving was required for certain unmapped tensors in the paged attention forward pass. I have tested the device mapping support on several text models and it appears to be functional.

Memory allocation is currently calculated as if all memory is available on a single device, resulting in that memory being split across devices; ideally, we should calculate available memory per GPU. If this PR is fine, I will work on this next.

Additionally, this feature currently only supports GPU devices due to an error when attempting to use reshape_and_cache() with non-cuda tensors.

Please let me know if you'd like me to revise anything!

github-actions · 2024-12-28T09:20:31Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           41           22           10            9
 JSON                   12          105          104            0            1
 Python                 63         2706         2338           71          297
 Shell                   1           57           22           18           17
 Plain Text              3         3723            0         2413         1310
 TOML                   18          605          539            2           64
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               43         3333            0         2526          807
 |- BASH                 6          103          100            0            3
 |- JSON                 1           12           12            0            0
 |- Python               7          121          109            0           12
 |- Rust                12          406          344            0           62
 |- TOML                 2           75           63            0           12
 (Total)                           4050          628         2526          896
-------------------------------------------------------------------------------
 Rust                  296        89600        80403         1861         7336
 |- Markdown           143         1593           25         1448          120
 (Total)                          91193        80428         3309         7456
===============================================================================
 Total                 445       100226        83475         6903         9848
===============================================================================

cdoko · 2024-12-29T10:26:05Z

Currently, device-mapped paged attention is approximately 10% slower compared to single device paged attention. I found the slowdown is at least partially due to the overhead of moving tensors to the device on every layer forward pass.

I implemented a temporary workaround in the ModelWeights::forward method by creating copies of the tensors, one for each device, before iterating over the layers, and passing the correct tensor to each layer. This reduced the slowdown to around 5% compared to single device paged attention.

Ideally, we would avoid making copies of the tensors in the model's forward and instead perform this operation in the inputs_processor. However, I have not yet found a way to pass the layer_devices information there.

EricLBuehler

Hi @cdoko! Thanks for the PR.

Currently, device-mapped paged attention is approximately 10% slower compared to single device paged attention. I found the slowdown is at least partially due to the overhead of moving tensors to the device on every layer forward pass.

With the comments I made, hopefully this should be adressed.

However, I have not yet found a way to pass the layer_devices information there.

Please feel free to make whatever changes you find necessary to get this to work!

I think a nice way to do this would be to add an API to the mapper (device_map.rs) to extract all the devices which will be mapped to (inlcuding normal_loading_metadata.real_device). One of my comments suggests a way to utilize this information, and I think this method would work nicely with that. What do you think?

mistralrs-core/src/pipeline/gguf.rs

EricLBuehler · 2024-12-29T22:18:44Z

mistralrs-core/src/paged_attention/layers/paged_attention.rs

@@ -70,6 +70,34 @@ impl PagedAttention {
            input_metadata.slot_mappings.clone()
        };

+        // When device mapping, these Tensors are fixed on the first device, and must be moved to the same device as q,k,v


I see you mentioned a performance penalty:

Currently, device-mapped paged attention is approximately 10% slower compared to single device paged attention. I found the slowdown is at least partially due to the overhead of moving tensors to the device on every layer forward pass.

To avoid this, can you please update PagedAttentionInputMetadata to store all tensors as hashmaps of device location to the actual tensor? Do you think this is a good solution?

This takes up more memory on each GPU but requires only one copy (in the inputs processor) and enables us to remove this section. I'm thinking something similar to this where we create multiple RoPE instantiations on different devices.

Additionally, I just merged #1014. Can you please merge with master to get these new changes, otherwise a conflict will occur with the addition of the changes I requested above.

EricLBuehler · 2024-12-29T22:20:14Z

mistralrs-core/src/models/gemma.rs

@@ -284,6 +284,7 @@ impl Attention {
            v.reshape((b_sz, self.num_kv_heads, q_len, self.head_dim))?
        };

+        let start_offsets_kernel = start_offsets_kernel.to_device(q.device())?;


Can we merge these into RotaryEmbedding in layers.rs?

cdoko · 2024-12-31T10:12:48Z

I updated PagedAttentionInputMetadata to store tensors as hashmaps of device location to the actual tensor. The performance penalty has decreased, but there's still a remaining small penalty of a few percent; I even tested a similar optimization to the attention mask in the model forward pass, since it's also copied every layer, but it didn't give any further performance improvement. I suspect this is due to the unavoidable mapping of xs in the layer forwards. Tensor parallelism might help mitigate the remaining penalty and potentially provide a performance gain.

To implement the updated PagedAttentionInputMetadata, I made some changes to the pipeline trait. Since the struct now uses hashmaps to store tensors, we need to have the mapper available wherever the struct is instantiated. I added the mapper to the MetadataMixin trait, allowing it to be passed from the pipeline to the input processor, and then used to create the PagedAttentionInputMetadata instance. I also noted that most pipelines don't use paged attention, so I used an Option to represent the mapper and passed None for those pipelines.

As a side note, I noticed that Qwen2 and Quantized Llama's RotaryEmbedding uses candle directly - was this intentional or should it be using mistralrs instead, which now includes the Tensor::to_device addition? Other models use the mistralrs one.

EricLBuehler · 2024-12-31T11:43:35Z

Thanks for the updates. I think this is close to merge.

I updated PagedAttentionInputMetadata to store tensors as hashmaps of device location to the actual tensor. The performance penalty has decreased, but there's still a remaining small penalty of a few percent; I even tested a similar optimization to the attention mask in the model forward pass, since it's also copied every layer, but it didn't give any further performance improvement. I suspect this is due to the unavoidable mapping of xs in the layer forwards. Tensor parallelism might help mitigate the remaining penalty and potentially provide a performance gain.

Sounds great! I agree, TP is something we should look into. I think the hard part is integrating it nicely with the existing codebase.

To implement the updated PagedAttentionInputMetadata, I made some changes to the pipeline trait. Since the struct now uses hashmaps to store tensors, we need to have the mapper available wherever the struct is instantiated. I added the mapper to the MetadataMixin trait, allowing it to be passed from the pipeline to the input processor, and then used to create the PagedAttentionInputMetadata instance. I also noted that most pipelines don't use paged attention, so I used an Option to represent the mapper and passed None for those pipelines.

Sounds good.

As a side note, I noticed that Qwen2 and Quantized Llama's RotaryEmbedding uses candle directly - was this intentional or should it be using mistralrs instead, which now includes the Tensor::to_device addition? Other models use the mistralrs one

Yes, can you please update it to use the ones in mistralrs?

cdoko · 2024-12-31T12:17:45Z

Yes, can you please update it to use the ones in mistralrs?

I already did, just wanted to confirm.

Sounds great! I agree, TP is something we should look into. I think the hard part is integrating it nicely with the existing codebase.

Personally I'm interested in speculative decoding for the much higher T/s. I took a look at the SpeculativePipeline implementation some time ago. From what I recall, the cache implementation is incomplete:
https://github.com/EricLBuehler/mistral.rs/blob/master/mistralrs-core/src/pipeline/speculative.rs#L247
Is resolving these cache issues the main blocker for getting speculative decoding working? And what about with PA?

If the PR is ok, I'll probably be working on the VRAM calculations for mistralrs-server next, because currently the flags like --pa-gpu-mem assume single device and don't account for multi-device setups.

EricLBuehler

@cdoko thanks for the PR!

Is resolving these cache issues the main blocker for getting speculative decoding working?

Yes, it's just that using the Normal cache isn't supported yet.

And what about with PA?

I think the main problem is that we need some extensive management of the KV cache (in particular, rolling back the cache) for PA, which I haven't implemented yet.

If the PR is ok, I'll probably be working on the VRAM calculations for mistralrs-server next, because currently the flags like --pa-gpu-mem assume single device and don't account for multi-device setups.

Sounds great!

cdoko added 15 commits December 28, 2024 04:42

Move start_offsets_kernel to correct device

acea1fd

Move start_offsets_kernel to correct device

6032955

Move start_offsets_kernel to correct device

946dfd9

Move start_offsets_kernel to correct device

8a134b9

Move start_offsets_kernel to correct device

6f02469

Move start_offsets_kernel to correct device

86d026a

Move start_offsets_kernel to correct device

d54e767

Move start_offsets_kernel to correct device

819278c

Move start_offsets_kernel to correct device

007f2db

Update starcoder2.rs

e7d2d80

Support device mapping

db0cdc5

Support device mapping

05ed5fe

Support device mapping

937319b

Support device mapping

047ca07

Support device mapping

882f4e7

cdoko added 4 commits December 28, 2024 05:25

format

a51bca6

Support device mapping

db78205

remove mut

e6324b4

remove mut

8fadbbc

EricLBuehler requested changes Dec 29, 2024

View reviewed changes

cdoko marked this pull request as draft December 31, 2024 08:22

cdoko added 7 commits December 31, 2024 04:29

Merge branch 'master' into device-mapping-paged-attn

9d9918d

Add get_unique_devices method

895d0a9

Move tensor for device mapping

7d46900

Add DeviceMapper

aa90ef2

Fix wrong RotaryEmbedding import

8fc40fc

Fix wrong RotaryEmbedding import

ad66e29

Remove unecessary tensor copies

e0719f9

cdoko added 24 commits December 31, 2024 04:54

Add device mapper

8a0177a

Add device mapper

b614be9

Add device mapper

30618da

Add device mapper

0215e86

Add device mapper

44e0559

Add device mapper

095e28a

Add device mapper

80eb294

Add device mapper

ef7ee66

Add device mapper

f269c55

Add device mapper

587b4f7

add device mapper

36d89c9

Remove unecessary tensor move

17f8065

Remove unecessary tensor move

3ca105a

Remove unecessary tensor move

40706f2

Remove unecessary tensor move

62d2126

Remove unecessary tensor move

78189c9

Remove unecessary tensor move

6724ee1

Remove unecessary tensor move

6ca0625

Remove unecessary tensor move

d3b4dae

format

ae3f53e

format

3bf680d

format

7560df9

clippy

83cf77d

format

45aad07

cdoko marked this pull request as ready for review December 31, 2024 10:12

EricLBuehler approved these changes Jan 1, 2025

View reviewed changes

EricLBuehler merged commit c345954 into EricLBuehler:master Jan 1, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support device mapping for Paged Attention #1011

Support device mapping for Paged Attention #1011

cdoko commented Dec 28, 2024

github-actions bot commented Dec 28, 2024 •

edited

Loading

cdoko commented Dec 29, 2024

EricLBuehler left a comment

EricLBuehler Dec 29, 2024

EricLBuehler Dec 29, 2024

cdoko commented Dec 31, 2024

EricLBuehler commented Dec 31, 2024

cdoko commented Dec 31, 2024

EricLBuehler left a comment

Support device mapping for Paged Attention #1011

Support device mapping for Paged Attention #1011

Conversation

cdoko commented Dec 28, 2024

github-actions bot commented Dec 28, 2024 • edited Loading

cdoko commented Dec 29, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment

EricLBuehler Dec 29, 2024

Choose a reason for hiding this comment

EricLBuehler Dec 29, 2024

Choose a reason for hiding this comment

cdoko commented Dec 31, 2024

EricLBuehler commented Dec 31, 2024

cdoko commented Dec 31, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 28, 2024 •

edited

Loading