Replies: 6 comments
-
this is the expected behavior not an error. TP will add the fc2_output from different tp ranks to get the final result |
Beta Was this translation helpful? Give feedback.
-
I understand that TP adds the fc2_output from different TP ranks to get the final result. However, my concern is with the correctness of the intermediate output from the activation layer. If this intermediate output is incorrect, then the reduced final result will also be incorrect. The following is the activation computation func (https://github.com/NVIDIA/Megatron-LM/blob/core_v0.7.0/megatron/core/transformer/moe/experts.py#L46):
which yields different results when TP is applied versus when it's not, even after reduction. Consider the following example:
With TP degree == 2 (mat2 and mat3 being inputs on TP rank 0 and 1):
The reduced results from r2 and r3 do not match r1 because, when TP degree > 1, each TP rank is multiplying using incorrect tensor values compared to the non-TP case. |
Beta Was this translation helpful? Give feedback.
-
The issue with the GLU activation in Tensor Parallel is causing correctness problems that are blocking training. An update on this or any suggestions for moving forward would be greatly appreciated. |
Beta Was this translation helpful? Give feedback.
-
i see your point. in your example, the results are different because of different tensor layout. The order of TP, glu sharding is shuffled. In practice, this shouldn't affect training because the linear layers are learned. this might affect training where the parallelism strategy or model architecture is changed mid training. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your response. My primary concern is with fine-tuning. If we pretrain using TP 2 and then load the checkpoint to fine-tune with TP1 or any configs other than TP2, then we would see loss issues. Do you have any suggestions for addressing this, or are there any plans on your end to provide fixes for this? |
Beta Was this translation helpful? Give feedback.
-
a workaround is to manually convert tensor layout when you switch to finetuning. |
Beta Was this translation helpful? Give feedback.
-
Description:
When training with GroupedMLP and Tensor Parallel (TP) enabled, and
gated_linear_unit
is activated, the activation function is applied to fc1_output. Assuming a TP degree of 2, this intermediate output only contains half of the information as it holds the tensor values on one TP rank. Applying the GLU activation function on this output leads to a loss of information because only half of the tensor values are involved in the activation function.Specifically, in the GLU function (https://github.com/NVIDIA/Megatron-LM/blob/core_v0.7.0/megatron/core/transformer/moe/experts.py#L48):
self.config.activation_func(x[0]) * x[1]
Both self.config.activation_func(x[0]) and x[1] contain half of the output tensor due to TP being enabled, resulting in an output that does not match the results from training without TP.
Steps to Reproduce:
Expected Behavior:
The activation function should correctly handle the tensor values across all TP ranks to prevent any loss of information, ensuring consistency with results obtained without TP.
Actual Behavior:
The GLU activation function is applied to tensor values that only represent half of the full tensor due to TP, leading to inconsistent results.
Beta Was this translation helpful? Give feedback.
All reactions