-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why deepspeed transformer only adjust init range on rank 0? #253
Comments
@LiweiPeng : The adjust init range happens on rank 0, but since the weights are broadcasted to all the ranks in deepspeed.initialize, it will be propagated to all the ranks. |
@samyam Thanks for the quick response. Because the deepspeed transformer code is an excellent feature itself, in case someone wants to use the deepspeed transformer code only without using the other part of the deepspeed functionality, is the adjust init range still only on rank 0? Thanks. |
@LiweiPeng I think so. Regardless of what you use for data parallelism (deepspeed or something else), it must do a broadcast at the beginning of the training to make sure all the parameters across different ranks are in sync. One issue I can imagine is in case the broadcast happens from an arbitrary rank source and not from rank 0 but that would be pretty weird implementation decision. |
@LiweiPeng I do understand though that its a bit confusing. Do you have a suggestion for clarification other than just adding a comment? |
Thanks for the clarification. I recommend to add some comment for this. I couldn't figure out a better way because it depends on a user's implementation. |
Thanks for open source your deepspeed transformer code.
At deepspeed_cuda.py function init_transformer_weights, the adjust init range feature is applied to rank 0 only. Can you explain why rank 0 only? Shouldn't it be applied to all ranks? Thanks.
The text was updated successfully, but these errors were encountered: