[WIP] add deepseek-v3 #35926

bzantium · 2025-01-28T05:45:28Z

What does this PR do?

This PR adds the codes for the DeepSeekV3.
code relies heavily on original remote code.

resolved: #35425

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case: DeepSeek V3 Support #35425
[] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

to: @ArthurZucker

…feature/huggingface#35425

…um/transformers into feature/huggingface#35425

Rocketknight1 · 2025-01-28T13:25:18Z

Hi @bzantium, this looks great so far! We'll need added tests for the model + a green CI, and then feel free to ping me to assign a reviewer, or if you have any problems with the port.

ArthurZucker

Ultra kudos! It's super nice
Mostly missing tests, here you can use a similar approach to the gemma2 tests, which use inheritance!

src/transformers/models/deepseek_v3/modular_deepseek_v3.py

cuichenx · 2025-01-29T17:15:39Z

@bzantium Thanks for the amazing work! I was wondering if you were able to train V3 with FSDP? If so how many GPUs did you need? Thanks!

ArthurZucker · 2025-01-29T17:29:29Z

One big thing would be TP support, the base_tp_plan would probably need to be updated to make sure each mlp's gat up down have the correct order, unless the direct usage of dist remove this need

casper-hansen · 2025-01-30T09:23:41Z

This is great work and I'm looking forward to try it out. For multi-token prediction, is this planned to be implemented in this PR via the num_nextn_predict_layers attribute in the config?

bzantium · 2025-01-30T11:14:46Z

Thanks for the comments in detail; following your comments, I revised code quite a lot and fixed some mismatch between original code and this PR. I checked the outputs from both are the same. I think now I can add test codes. For TP support, I think they can be applied only for mlp layer but not for self_attn because they have functions like split on the hidden_dim. I added as following:

    base_model_tp_plan = {
        "layers.*.gate_proj": "colwise",
        "layers.*.up_proj": "colwise",
        "layers.*.down_proj": "rowwise",
    }

to: @ArthurZucker

bzantium · 2025-01-30T11:19:24Z

@bzantium Thanks for the amazing work! I was wondering if you were able to train V3 with FSDP? If so how many GPUs did you need? Thanks!

I did not try training yet, since this PR only supports for inference currently. I plan to add trainable code afterwards.

bzantium · 2025-01-30T11:21:00Z

This is great work and I'm looking forward to try it out. For multi-token prediction, is this planned to be implemented in this PR via the num_nextn_predict_layers attribute in the config?

thanks for attention, I don't have plan to implement multi-token prediction for this time since there is no additional parameters for additional layer which is required for multi-token prediction.

casper-hansen · 2025-01-30T12:14:22Z

This is great work and I'm looking forward to try it out. For multi-token prediction, is this planned to be implemented in this PR via the num_nextn_predict_layers attribute in the config?

thanks for attention, I don't have plan to implement multi-token prediction for this time since there is no additional parameters for additional layer which is required for multi-token prediction.

The weights are provided. DeepSeek provided a neat description of this. Since num_nextn_predict_layers=1 and num_hidden_layers=61, the MTP module is layer 61.

bzantium · 2025-01-30T12:26:44Z

This is great work and I'm looking forward to try it out. For multi-token prediction, is this planned to be implemented in this PR via the num_nextn_predict_layers attribute in the config?

thanks for attention, I don't have plan to implement multi-token prediction for this time since there is no additional parameters for additional layer which is required for multi-token prediction.

The weights are provided. DeepSeek provided a neat description of this. Since num_nextn_predict_layers=1 and num_hidden_layers=61, the MTP module is layer 61.

I missed this! thanks for the reference, let me check how to deal with it.

ArthurZucker

Yep a lot better! For training you can also have a look at switch transformers and granite MOE where we do have losses and etc

ArthurZucker · 2025-01-30T13:52:22Z

src/transformers/models/deepseek_v3/configuration_deepseek_v3.py

@@ -145,6 +133,12 @@ class DeepseekV3Config(PretrainedConfig):

    model_type = "deepseek_v3"
    keys_to_ignore_at_inference = ["past_key_values"]
+    # Default tensor parallel plan for base model `DeepseekV3Model`


I think query keys and values also need to appear here no? 🤗

I'm not sure how TP is implemented here but multi latent attention implementation makes it hard to apply TP because they use lora_a -> RMSNorm -> lora_b -> split and so forth.

q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states))).view(hidden_shape).transpose(1, 2) q_nope, q_pe = torch.split(q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1) compressed_kv = self.kv_a_proj_with_mqa(hidden_states) compressed_kv, k_pe = torch.split(compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1) k_pe = k_pe.view(*input_shape, 1, self.qk_rope_head_dim).transpose(1, 2) kv = self.kv_b_proj(self.kv_a_layernorm(compressed_kv)).view(hidden_shape).transpose(1, 2)

does colwise_rep do all_gather after mm operation?

src/transformers/models/deepseek_v3/modeling_deepseek_v3.py

ArthurZucker · 2025-01-30T17:22:49Z

src/transformers/models/deepseek_v3/modeling_deepseek_v3.py

+        expert_outputs = []
+        current_pos = 0
+
+        for expert_idx, num_tokens in enumerate(tokens_per_expert):


nice, IDK how many experts are actually there, but given how important the model is, we might want to use more optimized version of MoE implementation. The best we have in transformers is for SwitchTransformers ! Were we skip experts that won't have any tokens.

One thing is also that this is not compile compatible "yet"?

I will check this out! thanks :)

ArthurZucker · 2025-01-30T17:24:13Z

Do you want me to jump on the PR and help you merge this faster @bzantium ? 🤗

bzantium · 2025-01-30T20:55:50Z

Do you want me to jump on the PR and help you merge this faster @bzantium ? 🤗

Of course! As you commented, it looks like there's still a lot to work left. (code optimization, training code, testing code and multi-token prediction if possible)
to: @ArthurZucker

bzantium added 9 commits January 28, 2025 14:42

add deepseekv3 modeling

704767e

Merge branch 'main' into feature/huggingface#35425

737ee3a

Merge branch 'main' of https://github.com/bzantium/transformers into …

fc3a4c7

…feature/huggingface#35425

remove redundant code

244e793

Merge branch 'feature/huggingface#35425' of https://github.com/bzanti…

0968df5

…um/transformers into feature/huggingface#35425

apply make style

4fb2a80

apply fix-copies

6b002e5

make format

4ec1e88

add init files

114ab84

bzantium added 7 commits January 28, 2025 23:21

rename deepseekv3 into deepseek_v3 based on its model_type

779f8d2

rename deepseekv3 into deepseek_v3 based on its model_type

22623a3

deepseek-v3 not deepseek_v3

78b19b0

set model_type as deepseek_v3

eb0e3a4

use default docs

57088cc

apply make

0ef561b

fill type and docstring

9a75a56

bzantium changed the title ~~[WIP] add deepseekv3~~ [WIP] add deepseek-v3 Jan 29, 2025

ArthurZucker mentioned this pull request Jan 29, 2025

Unknown quantization type, got fp8 #35471

Open

4 tasks

ruidazeng approved these changes Jan 29, 2025

View reviewed changes

bzantium added 2 commits January 30, 2025 00:28

add rope_config_validation

cdf83e4

use custom DeepseekV3MLP

51990b9

ruidazeng mentioned this pull request Jan 29, 2025

Does hf/transformers even support R1? huggingface/open-r1#116

Open

ArthurZucker reviewed Jan 29, 2025

View reviewed changes

bzantium added 2 commits January 30, 2025 20:00

hold code only for checkpoints congifuration; remove redundant

f4f0ebd

revise rope yarn for DeepSeek variation

4b72b30

bzantium added 2 commits January 30, 2025 20:23

Merge branch 'main' into feature/huggingface#35425

96562c4

rename DeepSeek-V3

6792cb5

regisss mentioned this pull request Jan 30, 2025

DeepSeek_v3 support huggingface/optimum-habana#1735

Draft

3 tasks

ArthurZucker reviewed Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] add deepseek-v3 #35926

[WIP] add deepseek-v3 #35926

bzantium commented Jan 28, 2025 •

edited

Loading

Rocketknight1 commented Jan 28, 2025

ArthurZucker left a comment

cuichenx commented Jan 29, 2025

ArthurZucker commented Jan 29, 2025 •

edited

Loading

casper-hansen commented Jan 30, 2025

bzantium commented Jan 30, 2025 •

edited

Loading

bzantium commented Jan 30, 2025 •

edited

Loading

bzantium commented Jan 30, 2025

casper-hansen commented Jan 30, 2025

bzantium commented Jan 30, 2025

ArthurZucker left a comment

ArthurZucker Jan 30, 2025

bzantium Jan 30, 2025 •

edited

Loading

ArthurZucker Jan 30, 2025

bzantium Jan 30, 2025

ArthurZucker commented Jan 30, 2025

bzantium commented Jan 30, 2025 •

edited

Loading

[WIP] add deepseek-v3 #35926

Are you sure you want to change the base?

[WIP] add deepseek-v3 #35926

Conversation

bzantium commented Jan 28, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

Rocketknight1 commented Jan 28, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

cuichenx commented Jan 29, 2025

ArthurZucker commented Jan 29, 2025 • edited Loading

casper-hansen commented Jan 30, 2025

bzantium commented Jan 30, 2025 • edited Loading

bzantium commented Jan 30, 2025 • edited Loading

bzantium commented Jan 30, 2025

casper-hansen commented Jan 30, 2025

bzantium commented Jan 30, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jan 30, 2025

Choose a reason for hiding this comment

bzantium Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

ArthurZucker Jan 30, 2025

Choose a reason for hiding this comment

bzantium Jan 30, 2025

Choose a reason for hiding this comment

ArthurZucker commented Jan 30, 2025

bzantium commented Jan 30, 2025 • edited Loading

bzantium commented Jan 28, 2025 •

edited

Loading

ArthurZucker commented Jan 29, 2025 •

edited

Loading

bzantium commented Jan 30, 2025 •

edited

Loading

bzantium commented Jan 30, 2025 •

edited

Loading

bzantium Jan 30, 2025 •

edited

Loading

bzantium commented Jan 30, 2025 •

edited

Loading