Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on Zero Initialization in FinalLayer of DiT Model #82

Open
denemmy opened this issue Apr 13, 2024 · 3 comments
Open

Clarification on Zero Initialization in FinalLayer of DiT Model #82

denemmy opened this issue Apr 13, 2024 · 3 comments

Comments

@denemmy
Copy link

denemmy commented Apr 13, 2024

Hello Facebook Research Team,

I am exploring the DiT as implemented in your repository and came across the weight initialization strategy for the FinalLayer, particularly observed in this section of the code.

The weights for the linear layer in the FinalLayer are initialized to zeros:

nn.init.constant_(self.final_layer.linear.weight, 0)
nn.init.constant_(self.final_layer.linear.bias, 0)

Typically, neural network weights are initialized with non-zero values to break symmetry and ensure diverse feature learning. While I understand the rationale behind zero initialization of modulation weights in other parts of the model, the zero initialization in this linear layer caught my attention.

Is the zero initialization of weights in this non-modulation linear layer intentional, and could you provide any insights into this choice?

Thank you for any information or insights you can provide!

Best regards,
Danil.

@tanghengjian
Copy link

zero initializtion may help for model's stable and reproducible ?

@shy19960518
Copy link

Same confusion. The most outrageous thing is that the model can still learn well in my experiment. Can someone have an explains. ^ ^

@zhaohm14
Copy link

Hi Danil,

I have the same confusion too. However, although I don't understand how the zero initialization on final_layer.linear benefits, I believe this operation should not cause symmetry problems that hinder training.

The symmetry problem occurs most often in multi-layer networks with hidden nodes. During backpropagation, if all hidden nodes in the same layer share the same values and weights due to identical initialization, it leads to a symmetry problem where the hidden layer effectively functions as a single node.

To avoid the symmetry problem in neural networks, at each layer, either the inputs $I$ or the gradients with respect to the outputs $\frac{\partial L}{\partial O}$ must not be symmetric. This is because the gradient with respect to the weights is calculated as $\frac{\partial L}{\partial W} = I^T \cdot \frac{\partial L}{\partial O}$, and asymmetry in either term ensures diverse weight updates.

However, there is no hidden layer in final_layer.linear or adaLN_modulation. Although the outputs and weights might be symmetrical in the first step, the inputs are not symmetrical. This asymmetry in the inputs ensures that the weights are updated differently, thus breaking the symmetry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants