Zero-grad more aggressively to save memory #71

cchan · 2023-01-20T06:12:08Z

Same as karpathy/minGPT#106 - sorry for the spam, am actually using both of these repos (they've been insanely useful <3)

karpathy · 2023-01-20T17:27:37Z

@cchan can you share intuition on why this makes so much difference?

cchan · 2023-01-20T22:13:04Z

@karpathy it's just that after you run backward() and before you run zero_grad(set_to_none=True), your gradient tensors are actually still taking up space in memory. Importantly, your gradients are still around at the moment you've finished the forward pass, which is the most memory intensive because of all the activations that need to be kept around for gradient calculations later.

Here's a bad diagram to mull over :) this is actually almost exactly what the real memory profile for minGPT looks like.

As a side note, if you think about it you can also eliminate the purple memory almost entirely if you run the optimizer immediately after every individual tensor is generated in the backward pass (ofc doesn't work with local accumulation). But that doesn't improve peak memory usage, which is usually what matters.

karpathy · 2023-01-20T22:27:14Z

I'm not sure if I follow because my original code zero_grads right before the forward pass, so those gradients should be gone during the forward pass.

cchan · 2023-01-20T22:30:39Z

Ah for nanoGPT you're right. In minGPT it's right after the fwd but not here.

My habit is just to zero_grad(set_to_none=True) immediately after the optimizer, since you might as well get rid of gradients ASAP.

karpathy · 2023-01-20T22:37:33Z

ok that makes more sense :) yes I agree with you, feels much better and safer to free it right away once it is not needed.

vgoklani · 2023-01-21T02:21:50Z

Hey @cchan where did you get that diagram from? Could you please share the URL. thanks!

cchan · 2023-01-21T09:48:59Z

@vgoklani I drew it on an MS Paint-like website in about 10 minutes lol, why?

vgoklani · 2023-01-21T16:07:38Z

@cchan it's a pretty professional-looking drawing!

I was looking for some info about optimizing memory usage during training...

Zero-grad more aggressively to save memory

6716607

karpathy merged commit 3611338 into karpathy:master Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero-grad more aggressively to save memory #71

Zero-grad more aggressively to save memory #71

cchan commented Jan 20, 2023 •

edited

Loading

karpathy commented Jan 20, 2023

cchan commented Jan 20, 2023 •

edited

Loading

karpathy commented Jan 20, 2023

cchan commented Jan 20, 2023 •

edited

Loading

karpathy commented Jan 20, 2023

vgoklani commented Jan 21, 2023

cchan commented Jan 21, 2023

vgoklani commented Jan 21, 2023

Zero-grad more aggressively to save memory #71

Zero-grad more aggressively to save memory #71

Conversation

cchan commented Jan 20, 2023 • edited Loading

karpathy commented Jan 20, 2023

cchan commented Jan 20, 2023 • edited Loading

karpathy commented Jan 20, 2023

cchan commented Jan 20, 2023 • edited Loading

karpathy commented Jan 20, 2023

vgoklani commented Jan 21, 2023

cchan commented Jan 21, 2023

vgoklani commented Jan 21, 2023

cchan commented Jan 20, 2023 •

edited

Loading

cchan commented Jan 20, 2023 •

edited

Loading

cchan commented Jan 20, 2023 •

edited

Loading