dot_product = s_loss_old - s_loss_new but s_loss_new - s_loss_old? #6

zwd973 · 2021-03-03T16:13:08Z

Hello, thanking for your pytorch implement of MPL. I think the dot_prodoct should be s_loss_old - s_loss_new but s_loss_new - s_loss_old for the reason here

just flip a coin

Am I wrong?

dagleaves · 2021-03-04T01:08:33Z

Where is that from?

According to the original code, it is s_loss_new - s_loss_old:
Link
dot_product = cross_entropy['s_on_l_new'] - shadow

where shadow is defined as:
Link

shadow = tf.get_variable(name='cross_entropy_old', shape=[], trainable=False, dtype=tf.float32)
shadow_update = tf.assign(shadow, cross_entropy['s_on_l_old'])

monney · 2021-03-04T03:32:21Z

@zwd973 I can't fully follow your derivation. But this is the formula used in the original code, as stated above. I believe it is correct as is. Here's the derivation:

First order Taylor:
f(x) = f(a) + f'(x)(x-a)

let f(x) be the cross entropy function, x is the new parameters
let a be x+h, or the old parameters (h is the gradient on the unlabeled data for the old parameters)
f(x) = f(x+h) + f'(x)*(x-x+h)
f(x) - f(x+h) = f'(x) * h

this is new cross entropy minus old.

zwd973 · 2021-03-04T03:43:43Z

@zwd973 I can't fully follow your derivation. But this is the formula used in the original code, as stated above. I believe it is correct as is. Here's the derivation:

First order Taylor:
f(x) = f(a) + f'(x)(x-a)

let f(x) be the cross entropy function, x is the new parameters
let a be x+h, or the old parameters (h is the gradient on the unlabeled data for the old parameters)
f(x) = f(x+h) + f'(x)*(x-x+h)
f(x) - f(x+h) = f'(x) * h

this is new cross entropy minus old.
year,but f(x) = f(x + h) + f'(x) *(x-(x+h)) = f(x+h) + f'(x) *(x-x-h) but f(x+h) + f'(x) * (x-x +h) isn't it?

monney · 2021-03-04T03:45:58Z

@zwd973 You're right, I missed a negative. Interesting. The original author's code is wrong here then

zwd973 · 2021-03-04T03:54:34Z

OK, thanks.

monney · 2021-03-04T03:55:50Z

@kekmodel this might be why it got worse? Though Im not sure how the author was able to replicate results

kekmodel · 2021-03-04T09:03:22Z

I agree. I thought that the sign changes in the process of calculating moving_dot_product. But I confirmed that it was not. I will test again. Thanks!

zwd973 · 2021-03-05T12:48:53Z

@kekmodel Hello, how about the new result?

kekmodel · 2021-03-06T05:23:22Z

Unfortunately, all test accuracy is about 94.4. The mpl loss doesn't seem to work. I'll have to wait until the author's code update.

monney · 2021-03-06T05:33:55Z

@kekmodel thats unfortunate to hear. But thank you for all your work thus far. The number of discrepancies in the original code make things quite difficult.

dgedon · 2021-03-16T14:02:20Z

If I am not mistaken, then the first order Taylor expansion goes as
f(x) = f(a) + f'(a)(x-a).
So there is f'(a) instead of f'(x). Then with the same notation as above: f(x) as cross-entropy, x as new parameters, a=x+h as old parameters, we get

f(x) = f(x+h) + f'(x+h)(x-(x+h)) 
     = f(x+h) - f'(x+h) h
f(x+h)-f(x) = f'(x+h) h

where h is described above as the gradient. This already is a problem since on the right hand side it is not the dot product between the gradient on the new parameters and the gradient on the old parameters. It is more the gradient on the old parameters squared, if I understand correctly.

From that perspective the first order Taylor approximation does not make sense. Can you confirm or tell me if and where I am wrong?

monney · 2021-03-16T14:20:14Z

If I am not mistaken, then the first order Taylor expansion goes as
f(x) = f(a) + f'(a)(x-a).
So there is f'(a) instead of f'(x). Then with the same notation as above: f(x) as cross-entropy, x as new parameters, a=x+h as old parameters, we get
f(x) = f(x+h) + f'(x+h)(x-(x+h)) 
     = f(x+h) - f'(x+h) h
f(x+h)-f(x) = f'(x+h) h
where h is described above as the gradient. This already is a problem since on the right hand side it is not the dot product between the gradient on the new parameters and the gradient on the old parameters. It is more the gradient on the old parameters squared, if I understand correctly.

From that perspective the first order Taylor approximation does not make sense. Can you confirm or tell me if and where I am wrong?

You’re correct from what I can see. Sorry, I did the derivation quickly and haphazardly, which is why it’s wrong lol.

This quantity still has meaning since h is the gradient produced by the loss on the unlabeled target and this is the loss on the labeled data. So we’re essentially trying to get the teacher to produce the same loss as if the student was training on labeled data, but this also doesn’t seem to be what was derived in the paper. There’s supposed to be a time offset

hyhieu · 2021-03-16T18:13:52Z

About your derivations. I do not see anything wrong with @dgedon's derivation.

f(x) = f(a) + f'(a)(x-a).
So there is f'(a) instead of f'(x). Then with the same notation as above: f(x) as cross-entropy, x as new parameters, a=x+h as old parameters, we get
f(x) = f(x+h) + f'(x+h)(x-(x+h)) 
     = f(x+h) - f'(x+h) h
f(x+h)-f(x) = f'(x+h) h

Comparing this to my derivation below, it looks like the difference is in the very first place, where you start at f(x+h) while I start at f(x). Details are in the equations, but intuitively, I think Taylor expansion says that locally, functions behave linearly in their gradients' direction. That is why we can start at either f(x+h) which will lead to your derivation, or at f(x) which will lead to my derivation below.

About Taylor. My understanding is as follows. Using your notations, x is the new parameters, x+h is the old parameters, and h is the gradient computed at the old parameters (so h is used in order to go from x+h into x).

       f(x+h) = f(x) + f'(x) * h
f(x+h) - f(x) = f'(x) * h

Corresponding this to Equation 12 in the paper which I copied above, x is \theta'_S (the red box) and h is the blue box (sorry, this is a different h from the scalar h in the screenshot).

About using soft labels. If you use soft labels, you do not even need Taylor or the log-gradient trick, because the entire process is differentiable and you can do some Hessian/Jacobian vector product tricks instead.

In my implementation for this, I created shadow variables that hold the student's parameters, then build a computational graph to compute the gradients of these shadow variables using tf.gradients. Then, I manually computed the derivative with respect to optimizers (note that everything we are discussing here are still subjected to the computations inside optimizers such as RMSProp or Momentum). From this values, you can follow this guide to compute the correct, non-approximated gradient for the teacher.

For ugly reasons (exceeding graph proto size limits, if you are curious), this implementation did not run with GShard which we used for model parallelism, so we decided to do approximation instead.

Update code. I got some insider push backs because I was trying to update the code and release the trained checkpoints at the same time. I apologize for the delay, and will try to push on this more.

dgedon · 2021-03-17T09:26:13Z

Thanks @hyhieu.

About Taylor expansion: It works out nicely when you start in your way. However, I have two follow-up remarks/question on it:

With your derivation I think you have student loss old - student loss new, which is actually different to your implementation here, where you have loss new - loss old and it is different to this repository:

MPL-pytorch/main.py

Line 216 in bdedb8e

dot_product = s_loss_l_new - s_loss_l_old

which is actually the main discussion point of this issue.
When comparing f(x+h) - f(x) = f'(x) * h with (12) from your paper, then I assume one should take the pseudo labeled data (x_u, \hat{y}_u) for the 'student loss old' (and labeled data (x_l, y_l) for the 'student loss new'). However, in this repository's code it is the following. Is this a mistake or do I misinterpret the equations?

MPL-pytorch/main.py

Line 197 in bdedb8e

s_loss_l_old = F.cross_entropy(s_logits_l.detach(), targets)

About Soft Labels: I have to think this a bit more through. In your paper in (10) you have instead of a one-hot encoding with hard labels for \hat{y}_u just a 'smoothed' version when using soft labels. From this point I don't understand how this changes the derivation.

zxhuang97 · 2021-04-27T06:44:56Z

Thanks @hyhieu.

About Taylor expansion: It works out nicely when you start in your way. However, I have two follow-up remarks/question on it:

With your derivation I think you have student loss old - student loss new, which is actually different to your implementation here, where you have loss new - loss old and it is different to this repository:

MPL-pytorch/main.py

Line 216 in bdedb8e

dot_product = s_loss_l_new - s_loss_l_old

which is actually the main discussion point of this issue.

When comparing f(x+h) - f(x) = f'(x) * h with (12) from your paper, then I assume one should take the pseudo labeled data (x_u, \hat{y}_u) for the 'student loss old' (and labeled data (x_l, y_l) for the 'student loss new'). However, in this repository's code it is the following. Is this a mistake or do I misinterpret the equations?

MPL-pytorch/main.py

Line 197 in bdedb8e

s_loss_l_old = F.cross_entropy(s_logits_l.detach(), targets)

About Soft Labels: I have to think this a bit more through. In your paper in (10) you have instead of a one-hot encoding with hard labels for \hat{y}_u just a 'smoothed' version when using soft labels. From this point I don't understand how this changes the derivation.

@dgedon For the second question, I think we are approximating the red box(gradients of loss on labeled data wrt updated parameters). So when using finite difference, we should use the same data(labeled data) with different parameters(old/new).

easonyang1996 · 2021-05-28T06:44:00Z

@kekmodel Hi, Thanks for the implementation!
So, is it clear now which one is right? loss_new-loss_old or loss_old-loss_new?

Adamdad · 2021-05-31T18:01:05Z

This might be a clearer derivation.

monney · 2021-05-31T18:55:25Z

I think that the correct formula is old-new based on the several derivations that have been done here. But, I don't think the MPL Loss really has an effect either way. From what I can tell, based on the experiments here, my own experiments, and the reference code being flipped but still replicating results.

I have a custom implementation I did at work, for our datasets, I was able to get good results with it. It beat UDA alone and other contrastive techniques I tried. As an aside, it only worked if I used a much larger unlabeled batch size (7x multiplier) this is similar to the released code, but the paper claimed it should work 1 to 1.

I ran an extensive hyperparameter search to see if MPL Loss helps at all, it seemed to make no real difference no matter the settings (at least on the several internal problems I tried it on). They are of comparable size and difficulty to CIFAR-10. One is much larger and closer to ImageNet. I also tried several networks. The hyperparameter search did not tend towards keeping it or not, there was no statistical difference among the temperatures of the loss, including a temp 1.0 which disables the loss, and taking the best settings with or without MPL loss active seems to make no difference. Maybe it helps for ImageNet or Cifar-10, but, experiments here don't support that. It certainly does not help for the various problems I tried it on. That being said, the procedure itself works quite well, just not due to MPL Loss I think.

@kekmodel not sure if you have run experiments with larger unlabeled batch sizes, but it's probably worth trying, as I couldn't get it to work without this, but it performs better than anything else I tried, under this setting.

zxhuang97 · 2021-05-31T20:12:39Z

@monney Thank you for your valuable insights. I have some follow-up questions regarding your experiments.

When you say "it beat UDA alone", do you mean "MPL+UDA+large unlabeled batch size" beats "UDA+large unlabeled batch size"? Is it possible that the performance gain comes from a larger batch size?
In the third paragraph, do you mean that MPL doesn't work for your own problem even after hyper-parameter search(including larger batch size)?

Thanks : )

monney · 2021-05-31T20:33:48Z

@monney Thank you for your valuable insights. I have some follow-up questions regarding your experiments.

When you say "it beat UDA alone", do you mean "MPL+UDA+large unlabeled batch size" beats "UDA+large unlabeled batch size"? Is it possible that the performance gain comes from a larger batch size?

all my experiments for both were done with larger unlabeled batch sizes and similar training. The benefit almost certainly comes from the self distillation procedure, and the unique finetuning phase of MPL.

In the third paragraph, do you mean that MPL doesn't work for your own problem even after hyper-parameter search(including larger batch size)?

It works, and works better the other contrastive learning methods I’ve tried (UDA, BYOL, SimCLR, NoisyStudent). But the actual MPL loss seems to have no major effect on the results and I think the other differences of this paper are largely responsible for the increased performance. My guess is in the end the paper ends up being very similar to fixmatch.

cheers

zxhuang97 · 2021-05-31T21:10:32Z

@monney I see. That's a little surprising as the MPL objective makes a lot of sense to me. Also, figure.3 in the appendix breaks down the contribution of each component, and it shows that whether using the MPL loss will make a huge difference.

monney · 2021-05-31T21:22:37Z

@zxhuang97 it makes a lot of sense to me as well, so confusing. I’ll update if I find bugs or anything, but I’ve done a lot of testing. The breakdown in fig 3. will include the entire MPL procedure I’m pretty sure, so it’s difficult to isolate just the loss contribution. UDA is just the standard UDA procedure.

zxhuang97 · 2021-05-31T22:30:14Z

@zxhuang97 it makes a lot of sense to me as well, so confusing. I’ll update if I find bugs or anything, but I’ve done a lot of testing. The breakdown in fig 3. will include the entire MPL procedure I’m pretty sure, so it’s difficult to isolate just the loss contribution. UDA is just the standard UDA procedure.

I guess you're right. The UDA module in the official implementation doesn't include the teacher&student stuff, so it's not really a fair comparison. Thank you for the information!

jacobunderlinebenseal · 2021-11-09T08:41:25Z

When training converges, theoretically, both s_loss_old - s_loss_new and s_loss_new - s_loss_old will be zero, is this the way it should be? Has anyone tried the none Taylor approximation way to calculate the dot product? Does it work?

Jacfger · 2022-01-26T14:52:25Z

@jacobunderlinebenseal
Wouldn't it make sense for it to be zero tho? The idea for the teacher model was to receive feedback from the performance of the student. When it is good enough (or "converging" I suppose), shouldn't it having a near zero update?

milanlx · 2022-03-31T04:20:53Z

This might be a clearer derivation.

One question: in the paper the product is between supervised and unsupervised gradient, which is different from the code.

DaehanKim · 2024-05-19T12:15:04Z

I read the whole thread and did a derivation again.
I believe correct implementation is old_s_loss - new_s_loss

1-order tayler expension goes
$$f(x) = f(a) + f'(a)(x-a)$$

Let $a = x+h$ and $a$ be new parameters.
and $f(\cdot)$ is cross entropy loss as above.

Then old_s_loss - new_s_loss becomes
$$f(x) - f(x+h) = -f'(x+h) \cdot h$$

and by definition

$$h=-\eta_{s}\nabla_{\theta_s}CE(\hat{y}_u, S(x_u,\theta_{s}))$$

and $f'(x+h)$ becomes

$$f'(x+h) = \nabla_{\theta_{s}^{'}} CE(y_l, S(x_l,\theta_{s}^{'}))$$

Thus,

$$f(x)-f(x+h) = \eta_{s}\nabla_{\theta_s}CE(\hat{y}_u, S(x_u,\theta_{s})) \nabla_{\theta_{s}^{'}} CE(y_l, S(x_l,\theta_{s}^{'}))$$

And this quantity is what we see in the paper

mxqmxqmxq · 2024-11-12T08:37:55Z

@monney感谢您的宝贵见解。我对您的实验有一些后续问题。

当您说“它击败了 UDA 本身”时，您的意思是“MPL+UDA+large unlabeled batch size”击败了“UDA+large unlabeled batch size”吗？性能提升是否可能来自更大的批处理大小？

我的所有实验都是使用更大的未标记批次大小和类似的训练进行的。好处几乎肯定来自自我蒸馏程序和 MPL 独特的微调阶段。

在第三段中，你的意思是即使经过超参数搜索（包括更大的批量大小），MPL 仍然无法解决你自己的问题？

它确实有效，而且比我尝试过的其他对比学习方法（UDA、BYOL、SimCLR、NoisyStudent）效果更好。但实际的 MPL 损失似乎对结果没有太大影响，我认为这篇论文的其他差异在很大程度上导致了性能的提高。我猜这篇论文最终会与 fixmatch 非常相似。

干杯

I am planning to use the MPL loss in my own project, and I wanted to kindly ask for your opinion: in your experience, have you found this method to deliver effective results? I just wanted to make sure I understand its impact accurately before implementing it.

dgedon mentioned this issue Mar 16, 2021

Question about Meta Pseudo Labels gradient signal google-research/google-research#534

Open

kekmodel mentioned this issue Mar 17, 2021

Question about teacher loss #11

Closed

ifsheldon mentioned this issue Feb 20, 2023

dot_product in your code ifsheldon/MPL_Lightning#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dot_product = s_loss_old - s_loss_new but s_loss_new - s_loss_old? #6

dot_product = s_loss_old - s_loss_new but s_loss_new - s_loss_old? #6

zwd973 commented Mar 3, 2021

dagleaves commented Mar 4, 2021 •

edited

Loading

monney commented Mar 4, 2021

zwd973 commented Mar 4, 2021

monney commented Mar 4, 2021 •

edited

Loading

zwd973 commented Mar 4, 2021

monney commented Mar 4, 2021

kekmodel commented Mar 4, 2021

zwd973 commented Mar 5, 2021

kekmodel commented Mar 6, 2021

monney commented Mar 6, 2021 •

edited

Loading

dgedon commented Mar 16, 2021

monney commented Mar 16, 2021 •

edited

Loading

hyhieu commented Mar 16, 2021 •

edited

Loading

dgedon commented Mar 17, 2021 •

edited

Loading

zxhuang97 commented Apr 27, 2021

easonyang1996 commented May 28, 2021 •

edited

Loading

Adamdad commented May 31, 2021

monney commented May 31, 2021

zxhuang97 commented May 31, 2021

monney commented May 31, 2021

zxhuang97 commented May 31, 2021 •

edited

Loading

monney commented May 31, 2021 •

edited

Loading

zxhuang97 commented May 31, 2021

jacobunderlinebenseal commented Nov 9, 2021

Jacfger commented Jan 26, 2022 •

edited

Loading

milanlx commented Mar 31, 2022

DaehanKim commented May 19, 2024

mxqmxqmxq commented Nov 12, 2024

dot_product = s_loss_old - s_loss_new but s_loss_new - s_loss_old? #6

dot_product = s_loss_old - s_loss_new but s_loss_new - s_loss_old? #6

Comments

zwd973 commented Mar 3, 2021

dagleaves commented Mar 4, 2021 • edited Loading

monney commented Mar 4, 2021

zwd973 commented Mar 4, 2021

monney commented Mar 4, 2021 • edited Loading

zwd973 commented Mar 4, 2021

monney commented Mar 4, 2021

kekmodel commented Mar 4, 2021

zwd973 commented Mar 5, 2021

kekmodel commented Mar 6, 2021

monney commented Mar 6, 2021 • edited Loading

dgedon commented Mar 16, 2021

monney commented Mar 16, 2021 • edited Loading

hyhieu commented Mar 16, 2021 • edited Loading

dgedon commented Mar 17, 2021 • edited Loading

zxhuang97 commented Apr 27, 2021

easonyang1996 commented May 28, 2021 • edited Loading

Adamdad commented May 31, 2021

monney commented May 31, 2021

zxhuang97 commented May 31, 2021

monney commented May 31, 2021

zxhuang97 commented May 31, 2021 • edited Loading

monney commented May 31, 2021 • edited Loading

zxhuang97 commented May 31, 2021

jacobunderlinebenseal commented Nov 9, 2021

Jacfger commented Jan 26, 2022 • edited Loading

milanlx commented Mar 31, 2022

DaehanKim commented May 19, 2024

mxqmxqmxq commented Nov 12, 2024

dagleaves commented Mar 4, 2021 •

edited

Loading

monney commented Mar 4, 2021 •

edited

Loading

monney commented Mar 6, 2021 •

edited

Loading

monney commented Mar 16, 2021 •

edited

Loading

hyhieu commented Mar 16, 2021 •

edited

Loading

dgedon commented Mar 17, 2021 •

edited

Loading

easonyang1996 commented May 28, 2021 •

edited

Loading

zxhuang97 commented May 31, 2021 •

edited

Loading

monney commented May 31, 2021 •

edited

Loading

Jacfger commented Jan 26, 2022 •

edited

Loading