Are you aware of the "PowerSGD" paper? #1

jukofyork · 2024-08-27T11:11:49Z

jukofyork
Aug 27, 2024

https://arxiv.org/abs/1905.13727

Not sure if you guys are aware of this paper (there is only 1 mention of the word "rank" from a 2024 paper in the references and no mention of this).

A quick "back of an envelope calculation" based on table 3 looks very like what PowerSGD's low-rank gradient approximations would give when scaled up to larger/squarer matrices and the authors even comment on this in the conclusion:

This speedup gained over SGD actually increases for larger models such as those commonly found in NLP.

Even if "DisTrO" isn't related to this method; it's likely useful as a comparison rather than table 3's 1.2B x fp16 x 32 nodes baseline.

bloc97 · 2024-08-27T16:20:43Z

bloc97
Aug 27, 2024
Maintainer

Thanks for the heads up! I've actually been sent this paper multiple times, but it was after we already did our training runs. It is an interesting method and bears some similarities with DisTrO, and its possible to learn a lot from it.

Is there an reference implementation somewhere that we can use? It is also important that it can be used with Adam/AdamW because other optimizers converge much slower when training LLM and are not even worth to compare against. I don't see any mention of Adam in that paper and I don't think it would be fair to put a SGD-Momentum optimizer against AdamW or DisTrO-AdamW... SGD-Momentum would be at like 5.5 loss after 100B tokens while all Adam-derivatives are at at like 2.5 or something...

On a side note, did anyone ever test PowerSGD at 1B+ scales? That would help us a lot since we can't run tests against every optimizer out there, that would be too expensive.

1 reply

jukofyork Aug 27, 2024
Author

Thanks for the heads up! I've actually been sent this paper multiple times, but it was after we already did our training runs. It is an interesting method and bears some similarities with DisTrO, and its possible to learn a lot from it.

Yeah, it's an interesting idea looking at it from a regularisation / inductive bias perspective too!

Is there an reference implementation somewhere that we can use? It is also important that it can be used with Adam/AdamW because other optimizers converge much slower when training LLM and are not even worth to compare against. I don't see any mention of Adam in that paper and I don't think it would be fair to put a SGD-Momentum optimizer against AdamW or DisTrO-AdamW... SGD-Momentum would be at like 5.5 loss after 100B tokens while all Adam-derivatives are at at like 2.5 or something...

I think they have a repo for a followup paper, but doubt the code from 2019 will be available (or suffered badly from bitrot if it has!).

On a side note, did anyone ever test PowerSGD at 1B+ scales? That would help us a lot since we can't run tests against every optimizer out there, that would be too expensive.

Again, not sure but it would be interesting to see (ie: matrix sizes growing quadratically vs linearly for the low rank gradients).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are you aware of the "PowerSGD" paper? #1

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Are you aware of the "PowerSGD" paper? #1

jukofyork Aug 27, 2024

Replies: 1 comment · 1 reply

bloc97 Aug 27, 2024 Maintainer

jukofyork Aug 27, 2024 Author

jukofyork
Aug 27, 2024

Replies: 1 comment 1 reply

bloc97
Aug 27, 2024
Maintainer

jukofyork Aug 27, 2024
Author