-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] add deepseek-v3 #35926
base: main
Are you sure you want to change the base?
[WIP] add deepseek-v3 #35926
Conversation
Hi @bzantium, this looks great so far! We'll need added tests for the model + a green CI, and then feel free to ping me to assign a reviewer, or if you have any problems with the port. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultra kudos! It's super nice
Mostly missing tests, here you can use a similar approach to the gemma2
tests, which use inheritance!
@bzantium Thanks for the amazing work! I was wondering if you were able to train V3 with FSDP? If so how many GPUs did you need? Thanks! |
One big thing would be |
This is great work and I'm looking forward to try it out. For multi-token prediction, is this planned to be implemented in this PR via the |
Thanks for the comments in detail; following your comments, I revised code quite a lot and fixed some mismatch between original code and this PR. I checked the outputs from both are the same. I think now I can add test codes. For
to: @ArthurZucker |
I did not try training yet, since this PR only supports for inference currently. I plan to add trainable code afterwards. |
thanks for attention, I don't have plan to implement multi-token prediction for this time since there is no additional parameters for additional layer which is required for multi-token prediction. |
The weights are provided. DeepSeek provided a neat description of this. Since |
I missed this! thanks for the reference, let me check how to deal with it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep a lot better! For training you can also have a look at switch transformers and granite MOE where we do have losses and etc
@@ -145,6 +133,12 @@ class DeepseekV3Config(PretrainedConfig): | |||
|
|||
model_type = "deepseek_v3" | |||
keys_to_ignore_at_inference = ["past_key_values"] | |||
# Default tensor parallel plan for base model `DeepseekV3Model` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think query keys and values also need to appear here no? 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how TP
is implemented here but multi latent attention implementation makes it hard to apply TP
because they use lora_a
-> RMSNorm
-> lora_b
-> split
and so forth.
q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states))).view(hidden_shape).transpose(1, 2)
q_nope, q_pe = torch.split(q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1)
compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
compressed_kv, k_pe = torch.split(compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1)
k_pe = k_pe.view(*input_shape, 1, self.qk_rope_head_dim).transpose(1, 2)
kv = self.kv_b_proj(self.kv_a_layernorm(compressed_kv)).view(hidden_shape).transpose(1, 2)
does colwise_rep
do all_gather
after mm operation?
expert_outputs = [] | ||
current_pos = 0 | ||
|
||
for expert_idx, num_tokens in enumerate(tokens_per_expert): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, IDK how many experts are actually there, but given how important the model is, we might want to use more optimized version of MoE implementation. The best we have in transformers
is for SwitchTransformers ! Were we skip experts that won't have any tokens.
One thing is also that this is not compile compatible "yet"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check this out! thanks :)
Do you want me to jump on the PR and help you merge this faster @bzantium ? 🤗 |
Of course! As you commented, it looks like there's still a lot to work left. (code optimization, training code, testing code and multi-token prediction if possible) |
What does this PR do?
This PR adds the codes for the DeepSeekV3.
code relies heavily on original remote code.
resolved: #35425
Before submitting
Pull Request section?
to it if that's the case: DeepSeek V3 Support #35425
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
to: @ArthurZucker