Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dot_product = s_loss_old - s_loss_new but s_loss_new - s_loss_old? #6

Open
zwd973 opened this issue Mar 3, 2021 · 28 comments
Open

Comments

@zwd973
Copy link

zwd973 commented Mar 3, 2021

Hello, thanking for your pytorch implement of MPL. I think the dot_prodoct should be s_loss_old - s_loss_new but s_loss_new - s_loss_old for the reason here
image
just flip a coin

Am I wrong?

@dagleaves
Copy link

dagleaves commented Mar 4, 2021

Where is that from?

According to the original code, it is s_loss_new - s_loss_old:
Link
dot_product = cross_entropy['s_on_l_new'] - shadow

where shadow is defined as:
Link

shadow = tf.get_variable(name='cross_entropy_old', shape=[], trainable=False, dtype=tf.float32)
shadow_update = tf.assign(shadow, cross_entropy['s_on_l_old'])

@monney
Copy link

monney commented Mar 4, 2021

@zwd973 I can't fully follow your derivation. But this is the formula used in the original code, as stated above. I believe it is correct as is. Here's the derivation:

First order Taylor:
f(x) = f(a) + f'(x)(x-a)

let f(x) be the cross entropy function, x is the new parameters
let a be x+h, or the old parameters (h is the gradient on the unlabeled data for the old parameters)
f(x) = f(x+h) + f'(x)*(x-x+h)
f(x) - f(x+h) = f'(x) * h

this is new cross entropy minus old.

@zwd973
Copy link
Author

zwd973 commented Mar 4, 2021

@zwd973 I can't fully follow your derivation. But this is the formula used in the original code, as stated above. I believe it is correct as is. Here's the derivation:

First order Taylor:
f(x) = f(a) + f'(x)(x-a)

let f(x) be the cross entropy function, x is the new parameters
let a be x+h, or the old parameters (h is the gradient on the unlabeled data for the old parameters)
f(x) = f(x+h) + f'(x)*(x-x+h)
f(x) - f(x+h) = f'(x) * h

this is new cross entropy minus old.
year,but f(x) = f(x + h) + f'(x) *(x-(x+h)) = f(x+h) + f'(x) *(x-x-h) but f(x+h) + f'(x) * (x-x +h) isn't it?

@monney
Copy link

monney commented Mar 4, 2021

@zwd973 You're right, I missed a negative. Interesting. The original author's code is wrong here then

@zwd973
Copy link
Author

zwd973 commented Mar 4, 2021

OK, thanks.

@monney
Copy link

monney commented Mar 4, 2021

@kekmodel this might be why it got worse? Though Im not sure how the author was able to replicate results

@kekmodel
Copy link
Owner

kekmodel commented Mar 4, 2021

I agree. I thought that the sign changes in the process of calculating moving_dot_product. But I confirmed that it was not. I will test again. Thanks!

@zwd973
Copy link
Author

zwd973 commented Mar 5, 2021

@kekmodel Hello, how about the new result?

@kekmodel
Copy link
Owner

kekmodel commented Mar 6, 2021

Unfortunately, all test accuracy is about 94.4. The mpl loss doesn't seem to work. I'll have to wait until the author's code update.

@monney
Copy link

monney commented Mar 6, 2021

@kekmodel thats unfortunate to hear. But thank you for all your work thus far. The number of discrepancies in the original code make things quite difficult.

@dgedon
Copy link

dgedon commented Mar 16, 2021

If I am not mistaken, then the first order Taylor expansion goes as
f(x) = f(a) + f'(a)(x-a).
So there is f'(a) instead of f'(x). Then with the same notation as above: f(x) as cross-entropy, x as new parameters, a=x+h as old parameters, we get

f(x) = f(x+h) + f'(x+h)(x-(x+h)) 
     = f(x+h) - f'(x+h) h
f(x+h)-f(x) = f'(x+h) h

where h is described above as the gradient. This already is a problem since on the right hand side it is not the dot product between the gradient on the new parameters and the gradient on the old parameters. It is more the gradient on the old parameters squared, if I understand correctly.

From that perspective the first order Taylor approximation does not make sense. Can you confirm or tell me if and where I am wrong?

@monney
Copy link

monney commented Mar 16, 2021

If I am not mistaken, then the first order Taylor expansion goes as
f(x) = f(a) + f'(a)(x-a).
So there is f'(a) instead of f'(x). Then with the same notation as above: f(x) as cross-entropy, x as new parameters, a=x+h as old parameters, we get

f(x) = f(x+h) + f'(x+h)(x-(x+h)) 
     = f(x+h) - f'(x+h) h
f(x+h)-f(x) = f'(x+h) h

where h is described above as the gradient. This already is a problem since on the right hand side it is not the dot product between the gradient on the new parameters and the gradient on the old parameters. It is more the gradient on the old parameters squared, if I understand correctly.

From that perspective the first order Taylor approximation does not make sense. Can you confirm or tell me if and where I am wrong?

You’re correct from what I can see. Sorry, I did the derivation quickly and haphazardly, which is why it’s wrong lol.

This quantity still has meaning since h is the gradient produced by the loss on the unlabeled target and this is the loss on the labeled data. So we’re essentially trying to get the teacher to produce the same loss as if the student was training on labeled data, but this also doesn’t seem to be what was derived in the paper. There’s supposed to be a time offset

@hyhieu
Copy link

hyhieu commented Mar 16, 2021

About your derivations. I do not see anything wrong with @dgedon's derivation.

f(x) = f(a) + f'(a)(x-a).
So there is f'(a) instead of f'(x). Then with the same notation as above: f(x) as cross-entropy, x as new parameters, a=x+h as old parameters, we get

f(x) = f(x+h) + f'(x+h)(x-(x+h)) 
     = f(x+h) - f'(x+h) h
f(x+h)-f(x) = f'(x+h) h

Comparing this to my derivation below, it looks like the difference is in the very first place, where you start at f(x+h) while I start at f(x). Details are in the equations, but intuitively, I think Taylor expansion says that locally, functions behave linearly in their gradients' direction. That is why we can start at either f(x+h) which will lead to your derivation, or at f(x) which will lead to my derivation below.


About Taylor. My understanding is as follows. Using your notations, x is the new parameters, x+h is the old parameters, and h is the gradient computed at the old parameters (so h is used in order to go from x+h into x).

       f(x+h) = f(x) + f'(x) * h
f(x+h) - f(x) = f'(x) * h

image
Corresponding this to Equation 12 in the paper which I copied above, x is \theta'_S (the red box) and h is the blue box (sorry, this is a different h from the scalar h in the screenshot).


About using soft labels. If you use soft labels, you do not even need Taylor or the log-gradient trick, because the entire process is differentiable and you can do some Hessian/Jacobian vector product tricks instead.

In my implementation for this, I created shadow variables that hold the student's parameters, then build a computational graph to compute the gradients of these shadow variables using tf.gradients. Then, I manually computed the derivative with respect to optimizers (note that everything we are discussing here are still subjected to the computations inside optimizers such as RMSProp or Momentum). From this values, you can follow this guide to compute the correct, non-approximated gradient for the teacher.

For ugly reasons (exceeding graph proto size limits, if you are curious), this implementation did not run with GShard which we used for model parallelism, so we decided to do approximation instead.


Update code. I got some insider push backs because I was trying to update the code and release the trained checkpoints at the same time. I apologize for the delay, and will try to push on this more.

@dgedon
Copy link

dgedon commented Mar 17, 2021

Thanks @hyhieu.

About Taylor expansion: It works out nicely when you start in your way. However, I have two follow-up remarks/question on it:

  1. With your derivation I think you have student loss old - student loss new, which is actually different to your implementation here, where you have loss new - loss old and it is different to this repository:
    dot_product = s_loss_l_new - s_loss_l_old

    which is actually the main discussion point of this issue.
  2. When comparing f(x+h) - f(x) = f'(x) * h with (12) from your paper, then I assume one should take the pseudo labeled data (x_u, \hat{y}_u) for the 'student loss old' (and labeled data (x_l, y_l) for the 'student loss new'). However, in this repository's code it is the following. Is this a mistake or do I misinterpret the equations?
    s_loss_l_old = F.cross_entropy(s_logits_l.detach(), targets)

About Soft Labels: I have to think this a bit more through. In your paper in (10) you have instead of a one-hot encoding with hard labels for \hat{y}_u just a 'smoothed' version when using soft labels. From this point I don't understand how this changes the derivation.

@zxhuang97
Copy link

Thanks @hyhieu.

About Taylor expansion: It works out nicely when you start in your way. However, I have two follow-up remarks/question on it:

  1. With your derivation I think you have student loss old - student loss new, which is actually different to your implementation here, where you have loss new - loss old and it is different to this repository:

    dot_product = s_loss_l_new - s_loss_l_old

    which is actually the main discussion point of this issue.

  2. When comparing f(x+h) - f(x) = f'(x) * h with (12) from your paper, then I assume one should take the pseudo labeled data (x_u, \hat{y}_u) for the 'student loss old' (and labeled data (x_l, y_l) for the 'student loss new'). However, in this repository's code it is the following. Is this a mistake or do I misinterpret the equations?

    s_loss_l_old = F.cross_entropy(s_logits_l.detach(), targets)

About Soft Labels: I have to think this a bit more through. In your paper in (10) you have instead of a one-hot encoding with hard labels for \hat{y}_u just a 'smoothed' version when using soft labels. From this point I don't understand how this changes the derivation.

@dgedon For the second question, I think we are approximating the red box(gradients of loss on labeled data wrt updated parameters). So when using finite difference, we should use the same data(labeled data) with different parameters(old/new).

@easonyang1996
Copy link

easonyang1996 commented May 28, 2021

@kekmodel Hi, Thanks for the implementation!
So, is it clear now which one is right? loss_new-loss_old or loss_old-loss_new?

@Adamdad
Copy link

Adamdad commented May 31, 2021

image
This might be a clearer derivation.

@monney
Copy link

monney commented May 31, 2021

I think that the correct formula is old-new based on the several derivations that have been done here. But, I don't think the MPL Loss really has an effect either way. From what I can tell, based on the experiments here, my own experiments, and the reference code being flipped but still replicating results.

I have a custom implementation I did at work, for our datasets, I was able to get good results with it. It beat UDA alone and other contrastive techniques I tried. As an aside, it only worked if I used a much larger unlabeled batch size (7x multiplier) this is similar to the released code, but the paper claimed it should work 1 to 1.

I ran an extensive hyperparameter search to see if MPL Loss helps at all, it seemed to make no real difference no matter the settings (at least on the several internal problems I tried it on). They are of comparable size and difficulty to CIFAR-10. One is much larger and closer to ImageNet. I also tried several networks. The hyperparameter search did not tend towards keeping it or not, there was no statistical difference among the temperatures of the loss, including a temp 1.0 which disables the loss, and taking the best settings with or without MPL loss active seems to make no difference. Maybe it helps for ImageNet or Cifar-10, but, experiments here don't support that. It certainly does not help for the various problems I tried it on. That being said, the procedure itself works quite well, just not due to MPL Loss I think.

@kekmodel not sure if you have run experiments with larger unlabeled batch sizes, but it's probably worth trying, as I couldn't get it to work without this, but it performs better than anything else I tried, under this setting.

@zxhuang97
Copy link

@monney Thank you for your valuable insights. I have some follow-up questions regarding your experiments.

  1. When you say "it beat UDA alone", do you mean "MPL+UDA+large unlabeled batch size" beats "UDA+large unlabeled batch size"? Is it possible that the performance gain comes from a larger batch size?

  2. In the third paragraph, do you mean that MPL doesn't work for your own problem even after hyper-parameter search(including larger batch size)?

Thanks : )

@monney
Copy link

monney commented May 31, 2021

@monney Thank you for your valuable insights. I have some follow-up questions regarding your experiments.

  1. When you say "it beat UDA alone", do you mean "MPL+UDA+large unlabeled batch size" beats "UDA+large unlabeled batch size"? Is it possible that the performance gain comes from a larger batch size?

all my experiments for both were done with larger unlabeled batch sizes and similar training. The benefit almost certainly comes from the self distillation procedure, and the unique finetuning phase of MPL.

  1. In the third paragraph, do you mean that MPL doesn't work for your own problem even after hyper-parameter search(including larger batch size)?

It works, and works better the other contrastive learning methods I’ve tried (UDA, BYOL, SimCLR, NoisyStudent). But the actual MPL loss seems to have no major effect on the results and I think the other differences of this paper are largely responsible for the increased performance. My guess is in the end the paper ends up being very similar to fixmatch.

cheers

@zxhuang97
Copy link

zxhuang97 commented May 31, 2021

@monney I see. That's a little surprising as the MPL objective makes a lot of sense to me. Also, figure.3 in the appendix breaks down the contribution of each component, and it shows that whether using the MPL loss will make a huge difference.

@monney
Copy link

monney commented May 31, 2021

@zxhuang97 it makes a lot of sense to me as well, so confusing. I’ll update if I find bugs or anything, but I’ve done a lot of testing. The breakdown in fig 3. will include the entire MPL procedure I’m pretty sure, so it’s difficult to isolate just the loss contribution. UDA is just the standard UDA procedure.

@zxhuang97
Copy link

@zxhuang97 it makes a lot of sense to me as well, so confusing. I’ll update if I find bugs or anything, but I’ve done a lot of testing. The breakdown in fig 3. will include the entire MPL procedure I’m pretty sure, so it’s difficult to isolate just the loss contribution. UDA is just the standard UDA procedure.

I guess you're right. The UDA module in the official implementation doesn't include the teacher&student stuff, so it's not really a fair comparison. Thank you for the information!

@jacobunderlinebenseal
Copy link

When training converges, theoretically, both s_loss_old - s_loss_new and s_loss_new - s_loss_old will be zero, is this the way it should be? Has anyone tried the none Taylor approximation way to calculate the dot product? Does it work?

@Jacfger
Copy link

Jacfger commented Jan 26, 2022

@jacobunderlinebenseal
Wouldn't it make sense for it to be zero tho? The idea for the teacher model was to receive feedback from the performance of the student. When it is good enough (or "converging" I suppose), shouldn't it having a near zero update?

@milanlx
Copy link

milanlx commented Mar 31, 2022

image This might be a clearer derivation.

One question: in the paper the product is between supervised and unsupervised gradient, which is different from the code.

@DaehanKim
Copy link

I read the whole thread and did a derivation again.
I believe correct implementation is old_s_loss - new_s_loss

1-order tayler expension goes
$$f(x) = f(a) + f'(a)(x-a)$$

Let $a = x+h$ and $a$ be new parameters.
and $f(\cdot)$ is cross entropy loss as above.

Then old_s_loss - new_s_loss becomes
$$f(x) - f(x+h) = -f'(x+h) \cdot h$$

and by definition

$$h=-\eta_{s}\nabla_{\theta_s}CE(\hat{y}_u, S(x_u,\theta_{s}))$$

and $f'(x+h)$ becomes

$$f'(x+h) = \nabla_{\theta_{s}^{'}} CE(y_l, S(x_l,\theta_{s}^{'}))$$

Thus,

$$f(x)-f(x+h) = \eta_{s}\nabla_{\theta_s}CE(\hat{y}_u, S(x_u,\theta_{s})) \nabla_{\theta_{s}^{'}} CE(y_l, S(x_l,\theta_{s}^{'}))$$

And this quantity is what we see in the paper

image

@mxqmxqmxq
Copy link

@monney感谢您的宝贵见解。我对您的实验有一些后续问题。

  1. 当您说“它击败了 UDA 本身”时,您的意思是“MPL+UDA+large unlabeled batch size”击败了“UDA+large unlabeled batch size”吗?性能提升是否可能来自更大的批处理大小?

我的所有实验都是使用更大的未标记批次大小和类似的训练进行的。好处几乎肯定来自自我蒸馏程序和 MPL 独特的微调阶段。

  1. 在第三段中,你的意思是即使经过超参数搜索(包括更大的批量大小),MPL 仍然无法解决你自己的问题?

它确实有效,而且比我尝试过的其他对比学习方法(UDA、BYOL、SimCLR、NoisyStudent)效果更好。但实际的 MPL 损失似乎对结果没有太大影响,我认为这篇论文的其他差异在很大程度上导致了性能的提高。我猜这篇论文最终会与 fixmatch 非常相似。

干杯

I am planning to use the MPL loss in my own project, and I wanted to kindly ask for your opinion: in your experience, have you found this method to deliver effective results? I just wanted to make sure I understand its impact accurately before implementing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests