You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During my De-KD experiments, I noticed that the default reduction value in the code D_KL = nn.KLDivLoss()(F.log_softmax(outputs / T, dim=1), F.softmax(teacher_outputs / T, dim=1)) * (T * T) is set to default value "mean". Using "batchmean" might be more appropriate for computing the loss. Otherwise, the KL loss value could be quite small because it gets averaged instead of summed across the batch. This might not be ideal for knowledge distillation.
I printed the values of loss_CE and D_KL in the code, and I found that even with the factor alpha, it doesn’t quite balance the two losses.
Hello, I really appreciate you sharing this work.
I have a question:
During my De-KD experiments, I noticed that the default reduction value in the code
D_KL = nn.KLDivLoss()(F.log_softmax(outputs / T, dim=1), F.softmax(teacher_outputs / T, dim=1)) * (T * T)
is set to default value "mean". Using "batchmean" might be more appropriate for computing the loss. Otherwise, the KL loss value could be quite small because it gets averaged instead of summed across the batch. This might not be ideal for knowledge distillation.I printed the values of loss_CE and D_KL in the code, and I found that even with the factor alpha, it doesn’t quite balance the two losses.
Here’s some of the output:
The text was updated successfully, but these errors were encountered: