Clarification on Zero Initialization in FinalLayer of DiT Model #82

denemmy · 2024-04-13T10:36:13Z

Hello Facebook Research Team,

I am exploring the DiT as implemented in your repository and came across the weight initialization strategy for the FinalLayer, particularly observed in this section of the code.

The weights for the linear layer in the FinalLayer are initialized to zeros:

nn.init.constant_(self.final_layer.linear.weight, 0)
nn.init.constant_(self.final_layer.linear.bias, 0)

Typically, neural network weights are initialized with non-zero values to break symmetry and ensure diverse feature learning. While I understand the rationale behind zero initialization of modulation weights in other parts of the model, the zero initialization in this linear layer caught my attention.

Is the zero initialization of weights in this non-modulation linear layer intentional, and could you provide any insights into this choice?

Thank you for any information or insights you can provide!

Best regards,
Danil.

The text was updated successfully, but these errors were encountered:

tanghengjian · 2024-05-06T03:49:22Z

zero initializtion may help for model's stable and reproducible ?

shy19960518 · 2024-05-18T16:24:00Z

Same confusion. The most outrageous thing is that the model can still learn well in my experiment. Can someone have an explains. ^ ^

zhaohm14 · 2024-10-16T04:06:10Z

Hi Danil,

I have the same confusion too. However, although I don't understand how the zero initialization on final_layer.linear benefits, I believe this operation should not cause symmetry problems that hinder training.

The symmetry problem occurs most often in multi-layer networks with hidden nodes. During backpropagation, if all hidden nodes in the same layer share the same values and weights due to identical initialization, it leads to a symmetry problem where the hidden layer effectively functions as a single node.

To avoid the symmetry problem in neural networks, at each layer, either the inputs $I$ or the gradients with respect to the outputs $\frac{\partial L}{\partial O}$ must not be symmetric. This is because the gradient with respect to the weights is calculated as $\frac{\partial L}{\partial W} = I^T \cdot \frac{\partial L}{\partial O}$, and asymmetry in either term ensures diverse weight updates.

However, there is no hidden layer in final_layer.linear or adaLN_modulation. Although the outputs and weights might be symmetrical in the first step, the inputs are not symmetrical. This asymmetry in the inputs ensures that the weights are updated differently, thus breaking the symmetry.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on Zero Initialization in FinalLayer of DiT Model #82

Clarification on Zero Initialization in FinalLayer of DiT Model #82

denemmy commented Apr 13, 2024 •

edited

Loading

tanghengjian commented May 6, 2024

shy19960518 commented May 18, 2024

zhaohm14 commented Oct 16, 2024

Clarification on Zero Initialization in FinalLayer of DiT Model #82

Clarification on Zero Initialization in FinalLayer of DiT Model #82

Comments

denemmy commented Apr 13, 2024 • edited Loading

tanghengjian commented May 6, 2024

shy19960518 commented May 18, 2024

zhaohm14 commented Oct 16, 2024

denemmy commented Apr 13, 2024 •

edited

Loading