Replies: 1 comment 1 reply
-
Thanks for the heads up! I've actually been sent this paper multiple times, but it was after we already did our training runs. It is an interesting method and bears some similarities with DisTrO, and its possible to learn a lot from it. Is there an reference implementation somewhere that we can use? It is also important that it can be used with Adam/AdamW because other optimizers converge much slower when training LLM and are not even worth to compare against. I don't see any mention of Adam in that paper and I don't think it would be fair to put a SGD-Momentum optimizer against AdamW or DisTrO-AdamW... SGD-Momentum would be at like 5.5 loss after 100B tokens while all Adam-derivatives are at at like 2.5 or something... On a side note, did anyone ever test PowerSGD at 1B+ scales? That would help us a lot since we can't run tests against every optimizer out there, that would be too expensive. |
Beta Was this translation helpful? Give feedback.
-
https://arxiv.org/abs/1905.13727
Not sure if you guys are aware of this paper (there is only 1 mention of the word "rank" from a 2024 paper in the references and no mention of this).
A quick "back of an envelope calculation" based on table 3 looks very like what PowerSGD's low-rank gradient approximations would give when scaled up to larger/squarer matrices and the authors even comment on this in the conclusion:
Even if "DisTrO" isn't related to this method; it's likely useful as a comparison rather than table 3's 1.2B x fp16 x 32 nodes baseline.
Beta Was this translation helpful? Give feedback.
All reactions