Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-grad more aggressively to save memory #71

Merged
merged 1 commit into from
Jan 20, 2023

Conversation

cchan
Copy link
Contributor

@cchan cchan commented Jan 20, 2023

Same as karpathy/minGPT#106 - sorry for the spam, am actually using both of these repos (they've been insanely useful <3)

@karpathy
Copy link
Owner

@cchan can you share intuition on why this makes so much difference?

@cchan
Copy link
Contributor Author

cchan commented Jan 20, 2023

@karpathy it's just that after you run backward() and before you run zero_grad(set_to_none=True), your gradient tensors are actually still taking up space in memory. Importantly, your gradients are still around at the moment you've finished the forward pass, which is the most memory intensive because of all the activations that need to be kept around for gradient calculations later.

Here's a bad diagram to mull over :) this is actually almost exactly what the real memory profile for minGPT looks like.

Screen Shot 2023-01-20 at 2 02 51 PM

As a side note, if you think about it you can also eliminate the purple memory almost entirely if you run the optimizer immediately after every individual tensor is generated in the backward pass (ofc doesn't work with local accumulation). But that doesn't improve peak memory usage, which is usually what matters.

@karpathy
Copy link
Owner

I'm not sure if I follow because my original code zero_grads right before the forward pass, so those gradients should be gone during the forward pass.

@cchan
Copy link
Contributor Author

cchan commented Jan 20, 2023

Ah for nanoGPT you're right. In minGPT it's right after the fwd but not here.

My habit is just to zero_grad(set_to_none=True) immediately after the optimizer, since you might as well get rid of gradients ASAP.

@karpathy
Copy link
Owner

ok that makes more sense :) yes I agree with you, feels much better and safer to free it right away once it is not needed.

@karpathy karpathy merged commit 3611338 into karpathy:master Jan 20, 2023
@vgoklani
Copy link

Hey @cchan where did you get that diagram from? Could you please share the URL. thanks!

@cchan
Copy link
Contributor Author

cchan commented Jan 21, 2023

@vgoklani I drew it on an MS Paint-like website in about 10 minutes lol, why?

@vgoklani
Copy link

@cchan it's a pretty professional-looking drawing!

I was looking for some info about optimizing memory usage during training...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants