Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] add deepseek-v3 #35926

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Conversation

bzantium
Copy link
Contributor

@bzantium bzantium commented Jan 28, 2025

What does this PR do?

This PR adds the codes for the DeepSeekV3.
code relies heavily on original remote code.

resolved: #35425

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

to: @ArthurZucker

@Rocketknight1
Copy link
Member

Hi @bzantium, this looks great so far! We'll need added tests for the model + a green CI, and then feel free to ping me to assign a reviewer, or if you have any problems with the port.

@bzantium bzantium changed the title [WIP] add deepseekv3 [WIP] add deepseek-v3 Jan 29, 2025
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultra kudos! It's super nice
Mostly missing tests, here you can use a similar approach to the gemma2 tests, which use inheritance!

src/transformers/models/deepseek_v3/modular_deepseek_v3.py Outdated Show resolved Hide resolved
src/transformers/models/deepseek_v3/modular_deepseek_v3.py Outdated Show resolved Hide resolved
src/transformers/models/deepseek_v3/modular_deepseek_v3.py Outdated Show resolved Hide resolved
src/transformers/models/deepseek_v3/modular_deepseek_v3.py Outdated Show resolved Hide resolved
src/transformers/models/deepseek_v3/modular_deepseek_v3.py Outdated Show resolved Hide resolved
src/transformers/models/deepseek_v3/modular_deepseek_v3.py Outdated Show resolved Hide resolved
src/transformers/models/deepseek_v3/modular_deepseek_v3.py Outdated Show resolved Hide resolved
src/transformers/models/deepseek_v3/modular_deepseek_v3.py Outdated Show resolved Hide resolved
src/transformers/models/deepseek_v3/modular_deepseek_v3.py Outdated Show resolved Hide resolved
@cuichenx
Copy link

@bzantium Thanks for the amazing work! I was wondering if you were able to train V3 with FSDP? If so how many GPUs did you need? Thanks!

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jan 29, 2025

One big thing would be TP support, the base_tp_plan would probably need to be updated to make sure each mlp's gat up down have the correct order, unless the direct usage of dist remove this need

@casper-hansen
Copy link

This is great work and I'm looking forward to try it out. For multi-token prediction, is this planned to be implemented in this PR via the num_nextn_predict_layers attribute in the config?

@bzantium
Copy link
Contributor Author

bzantium commented Jan 30, 2025

Thanks for the comments in detail; following your comments, I revised code quite a lot and fixed some mismatch between original code and this PR. I checked the outputs from both are the same. I think now I can add test codes. For TP support, I think they can be applied only for mlp layer but not for self_attn because they have functions like split on the hidden_dim. I added as following:

    base_model_tp_plan = {
        "layers.*.gate_proj": "colwise",
        "layers.*.up_proj": "colwise",
        "layers.*.down_proj": "rowwise",
    }

to: @ArthurZucker

@bzantium
Copy link
Contributor Author

bzantium commented Jan 30, 2025

@bzantium Thanks for the amazing work! I was wondering if you were able to train V3 with FSDP? If so how many GPUs did you need? Thanks!

I did not try training yet, since this PR only supports for inference currently. I plan to add trainable code afterwards.

@bzantium
Copy link
Contributor Author

This is great work and I'm looking forward to try it out. For multi-token prediction, is this planned to be implemented in this PR via the num_nextn_predict_layers attribute in the config?

thanks for attention, I don't have plan to implement multi-token prediction for this time since there is no additional parameters for additional layer which is required for multi-token prediction.

@casper-hansen
Copy link

This is great work and I'm looking forward to try it out. For multi-token prediction, is this planned to be implemented in this PR via the num_nextn_predict_layers attribute in the config?

thanks for attention, I don't have plan to implement multi-token prediction for this time since there is no additional parameters for additional layer which is required for multi-token prediction.

The weights are provided. DeepSeek provided a neat description of this. Since num_nextn_predict_layers=1 and num_hidden_layers=61, the MTP module is layer 61.

image

@bzantium
Copy link
Contributor Author

This is great work and I'm looking forward to try it out. For multi-token prediction, is this planned to be implemented in this PR via the num_nextn_predict_layers attribute in the config?

thanks for attention, I don't have plan to implement multi-token prediction for this time since there is no additional parameters for additional layer which is required for multi-token prediction.

The weights are provided. DeepSeek provided a neat description of this. Since num_nextn_predict_layers=1 and num_hidden_layers=61, the MTP module is layer 61.

image

I missed this! thanks for the reference, let me check how to deal with it.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep a lot better! For training you can also have a look at switch transformers and granite MOE where we do have losses and etc

@@ -145,6 +133,12 @@ class DeepseekV3Config(PretrainedConfig):

model_type = "deepseek_v3"
keys_to_ignore_at_inference = ["past_key_values"]
# Default tensor parallel plan for base model `DeepseekV3Model`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think query keys and values also need to appear here no? 🤗

Copy link
Contributor Author

@bzantium bzantium Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how TP is implemented here but multi latent attention implementation makes it hard to apply TP because they use lora_a -> RMSNorm -> lora_b -> split and so forth.

q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states))).view(hidden_shape).transpose(1, 2)
q_nope, q_pe = torch.split(q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1)

compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
compressed_kv, k_pe = torch.split(compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1)
k_pe = k_pe.view(*input_shape, 1, self.qk_rope_head_dim).transpose(1, 2)
kv = self.kv_b_proj(self.kv_a_layernorm(compressed_kv)).view(hidden_shape).transpose(1, 2)

does colwise_rep do all_gather after mm operation?

expert_outputs = []
current_pos = 0

for expert_idx, num_tokens in enumerate(tokens_per_expert):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, IDK how many experts are actually there, but given how important the model is, we might want to use more optimized version of MoE implementation. The best we have in transformers is for SwitchTransformers ! Were we skip experts that won't have any tokens.

One thing is also that this is not compile compatible "yet"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check this out! thanks :)

@ArthurZucker
Copy link
Collaborator

Do you want me to jump on the PR and help you merge this faster @bzantium ? 🤗

@bzantium
Copy link
Contributor Author

bzantium commented Jan 30, 2025

Do you want me to jump on the PR and help you merge this faster @bzantium ? 🤗

Of course! As you commented, it looks like there's still a lot to work left. (code optimization, training code, testing code and multi-token prediction if possible)
to: @ArthurZucker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DeepSeek V3 Support
6 participants