-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient accumulate optimizer #2260
Comments
@tomerk please bring this up in ecosystem review, though I don't expect any conflicts. For the future do you need us to tag anyone or is the |
@tomerk @bhack @seanpmorgan any update :D. |
@bhack just remind :D |
Check #2196 (comment) |
Notes from ecosystem review: Rather than an optimizer that wraps another optimizer, we think this might actually make sense as an object to use as a gradient_transformer in optimizers (a new feature after an optimizer refactoring earlier this year): https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Optimizer Looping in @omalleyt12 who might have more insight / suggestions on how to do this. Depending on how it comes out it might make sense in core, but addons seems like a good initial spot. |
@omalleyt12 can you take a look ? |
@tomerk @bhack @omalleyt12 a gentle ping in case you missed my previous comment :D |
We have a quite duplicated ticket at tensorflow/tensorflow#32176 |
@bhack it's a long time ago, any plan for this feature for TF2 ? |
It would be great to have such a param:
|
IMO, this feature is not needed as we can implement gradient accumulation in the custom training in |
@innat we need it in case we want to use tf.keras model fit function :))). |
That should also possible to achieve by overriding the |
overriding |
If you want to plug and play, then try this. https://github.com/CyberZHG/keras-gradient-accumulation |
Overriding the |
@innat i know |
@dathudeptrai understood. It sounds great then. However, I'm facing some issues with implementing GA by customizing the updateSolved: https://gist.github.com/innat/ba6740293e7b7b227829790686f2119c |
The above gradient accumulation implementation doesn;t work with TF2.5 with multi GPU distribution strategy |
# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import tensorflow as tf
from tensorflow_addons.utils import types
from typeguard import typechecked
class GradientAccumulator(tf.keras.optimizers.Optimizer):
"""Optimizer wrapper for gradient accumulation."""
@typechecked
def __init__(
self,
optimizer: types.Optimizer,
accum_steps: types.TensorLike = 4,
name: str = "GradientAccumulator",
**kwargs,
):
r"""Construct a new GradientAccumulator optimizer.
Args:
optimizer: str or `tf.keras.optimizers.Optimizer` that will be
used to compute and apply gradients.
accum_steps: int > 0. Update gradient in every accumulation steps.
name: Optional name for the operations created when applying
gradients. Defaults to "GradientAccumulator".
**kwargs: keyword arguments. Allowed to be {`clipnorm`,
`clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by
norm; `clipvalue` is clip gradients by value, `decay` is
included for backward compatibility to allow time inverse
decay of learning rate. `lr` is included for backward
compatibility, recommended to use `learning_rate` instead.
"""
super().__init__(name, **kwargs)
self._optimizer = tf.keras.optimizers.get(optimizer)
self._gradients = []
self._accum_steps = accum_steps
def _create_slots(self, var_list):
self._optimizer._create_slots(var_list=var_list)
for var in var_list:
self.add_slot(var, "ga")
self._gradients = [self.get_slot(var, "ga") for var in var_list]
@property
def gradients(self):
"""The accumulated gradients on the current replica."""
if not self._gradients:
raise ValueError(
"The accumulator should be called first to initialize the gradients"
)
return list(
gradient.read_value() if gradient is not None else gradient
for gradient in self._gradients
)
def apply_gradients(self, grads_and_vars, name=None, **kwargs):
self._optimizer._iterations = self.iterations
return super().apply_gradients(grads_and_vars, name, **kwargs)
def _resource_apply_dense(self, grad, var, apply_state=None):
accum_gradient = self.get_slot(var, "ga")
if accum_gradient is not None and grad is not None:
accum_gradient.assign_add(
grad, use_locking=self._use_locking, read_value=False
)
def _apply():
if "apply_state" in self._optimizer._dense_apply_args:
train_op = self._optimizer._resource_apply_dense(
accum_gradient.read_value(), var, apply_state=apply_state
)
else:
train_op = self._optimizer._resource_apply_dense(
accum_gradient.read_value(), var
)
reset_op = accum_gradient.assign(
tf.zeros_like(accum_gradient),
use_locking=self._use_locking,
read_value=False,
)
return tf.group(train_op, reset_op)
apply_op = tf.cond(
(self.iterations+1) % self._accum_steps == 0, _apply, lambda: tf.no_op()
)
return apply_op
def _resource_apply_sparse(self, grad: types.TensorLike, var, indices, apply_state):
accum_gradient = self.get_slot(var, "ga")
if accum_gradient is not None and grad is not None:
self._resource_scatter_add(accum_gradient, indices, grad)
def _apply():
if "apply_state" in self._optimizer._sparse_apply_args:
train_op = self._optimizer._resource_apply_sparse(
accum_gradient.sparse_read(indices),
var,
indices,
apply_state=apply_state,
)
else:
train_op = self._optimizer._resource_apply_sparse(
accum_gradient.sparse_read(indices), var, indices
)
reset_op = accum_gradient.assign(
tf.zeros_like(accum_gradient),
use_locking=self._use_locking,
read_value=False,
)
return tf.group(train_op, reset_op)
apply_op = tf.cond(
(self.iterations+1) % self._accum_steps == 0, _apply, lambda: tf.no_op()
)
return apply_op
def reset(self):
"""Resets the accumulated gradients on the current replica."""
assign_ops = []
if not self._gradients:
return assign_ops
for gradient in self._gradients:
if gradient is not None:
assign_ops.append(
gradient.assign(
tf.zeros_like(gradient),
use_locking=self._use_locking,
read_value=False,
)
)
return tf.group(assign_ops)
@property
def lr(self):
return self._optimizer._get_hyper("learning_rate")
@lr.setter
def lr(self, lr):
self._optimizer._set_hyper("learning_rate", lr) #
@property
def learning_rate(self):
return self._optimizer._get_hyper("learning_rate")
@learning_rate.setter
def learning_rate(self, learning_rate):
self._optimizer._set_hyper("learning_rate", learning_rate)
def get_config(self):
config = {"accum_steps": self._accum_steps}
base_config = super().get_config()
return {**base_config, **config}
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=GradientAccumulator(tf.keras.optimizers.Adam(), accum_steps=4),
loss=loss_fn,
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5) Here is my implementation |
@fsx950223 Many thanks. If it's stable, could you make a pull request to support this feature for |
I'm not sure, maybe you could test it. |
When running this code on TF2.5 keras (run_eagerly=False) with MultiWorkerMirroredStrategy on 8 GPUs the time to train is ~2 times slower than running w/o gradient accumulation ( using the Class when accum_step=1). Do you know what is the reason for this x2 slowdown , the time to train should have stayed at the same. |
Could you provide tensorflow profiles? |
I found the default setting is faster than |
You could test the code on my PR which I have fixed several bugs. |
This seems to work just fine for training, but I have some strange behaviour when loading the trained model using tf.keras.models.load_model with compile=True. After training, I just load the model like so:
This results in:
Probably something silly I am missing here? Any ideas, @fsx950223? I just trained a simple CNN classifier, nothing fancy. Works fine if regular Adam is used. Tested using TF 2.8 and Python 3.8.10. |
I managed to get it working by doing the following modifications. Seems like the optimizer argument is missing in the get_config. I rewrote the config variable to:
and added Then when feeding the Adam optimizer, I got that:
I believe that is because of the type hint Lastly, in my example, the custom object should be I could post the final, clean version, if of interest. If you are planning on merging this into TF-addons, I could contribute to the PR. |
Hi @andreped , I just saw your implementation here (https://github.com/andreped/GradientAccumulator). Great work! and I hope it will work fine with multi-gpu strategy :D . |
@dathudeptrai I figured it was time to make a solution available in TF 2 for myself and others who have been using similar solutions for accumulated gradients in TF1. My implementation is a derived version from @fsx950223, which again is a modified version from the one mentioned above. Hence, all credit to him and the people who contributed to the PR #2525. His PR seems to have been closed, without the current solution being available for people to test, debug, and further expand upon. I took upon the challenge to work on this project further, to see if I can get it working how I want. Would love for this to be added to TF/TF-addons in the future (when we have a working solution). Accumulated gradients is something I use in almost all my projects. EDIT: Currently, I am seeing some strange behaviour, where I am getting worse results with accumulated gradients (mini batch size=2, accum_steps 8 vs mini batch size 16) compared to regular batch training. Will try to solve that first, before I try to get multi-gpu training working properly. Also having a BatchNormalization layer that works with accumulated gradients would be a great contribution. Maybe it is possible to make a wrapper around the BatchNormalization similar as done here. |
Is there any known reason why we not asking for this feature directly in tensorflow or keras? It's a very useful and recognized feature to have. It has been asked HERE before but closed without any valid reason. |
@innat Good question. I think they concluded with that Keras is the best place to add it, however, to add it there, it should be working as intended, for all relevant use cases, and stable. I'm not sure that is the case for the current version. AFAIK, for one, it does not work in multi-GPU strategy scenarios. However, that's why I believe TF-addons is a good alternative. Here one can place somewhat experimental solutions, which at least satisfies the largest bunch of users. I believe there is a lot of people using gradient accumulation on single-GPU scenarios - just working with large images/volumes/data in general, or working on low-end hardware. We have created our own wrapper-like implementations since TF 1.13.1, and it has worked wonders on our use cases. I have already asked them previously on why they closed the PR #2525 (comment). No response yet. Why that PR was closed in the end, I believe is because they were looking into a different way of solving the problem, which avoids the wrap-an-optimizer solution (even if it appears to work in single-GPU scenarios). But at least for now, I have made the current implementation more easily available to the public here. At least until a proper solution is integrated into TF-addons/Keras. |
I still believe this technique is a good fit to have in core keras API. It would be nice to create a ticket in keras, HERE and discuss it further. |
@innat I observed that you have been asking a question regarding GA on stack overflow and was given an answer: I just tested this implementation, and I get identical results using the proposed overload train_step approach and the GradientAccumulator wrapping approach (which should be a lot easier to use): However, in my benchmark experiments I am unable to get the same test performance for the same number of updated with and without accumulated gradients, so I believe something is fundamentally wrong, AFAIK. I made a comment about this in a different thread (see here), and with a link on how to reproduce this weird issue. Perhaps you could take a look? Any idea on what is wrong? EDIT: See this thread to track this issue. Will update you if I manage to solve it. |
@innat I have gotten the solution you were given on stack overflow working as expected. It yields expected behaviour compared to the gradient wrapping solution (but might be close to getting that one working as well). I have published a wheel you could try, if GA is still of interest to you. Will continue to expand upon this idea to make it more generic. See here for more information: #2525 (comment) and https://github.com/andreped/GradientAccumulator |
@andreped I will. Thanks. (following)
Let's create a ticket on keras if you want to contribute or provide a starter file for the interested contributor.
Have you tested it on a multi-gpu setup also? Is it working properly? cc. @bhack what are your thoughts on this (GA) technique adding to core keras? |
I was looking at some API design at: |
As Keras is having a quite important optimizer refactoring (#2706) it Is better to open a ticket about GA in Keras. Probably Graphcore could be interested to contribute something there... who knows? |
The optimizer wrapper solution does not, but I have not gotten to testing the train_step overload approach. Might be. I can report what I find.
Yeah, I think so too. I can open a ticket soon to track this feature request soon. |
I have opened a ticket now: keras-team/tf-keras#107 Lets move the discussion over there. |
@dathudeptrai Just saw that you mentioned this, which is exactly what I did here. Did you have any problems with using it? Currently, it seems to work quite fine for my applications, but have not checked very advanced situations and edge cases. Can also be noted that this our opened ticket at Keras has have gotten positive feedback, and it seems like they will add an API for doing gradient accumulation very soon :) |
TensorFlow Addons is transitioning to a minimal maintenance and release mode. New features will not be added to this repository. For more information, please see our public messaging on this decision: Please consider sending feature requests / contributions to other repositories in the TF community with a similar charters to TFA: |
I would like to try gradient accumulation on a codebase that is not yet updated to keras 3. I have been looking at this PR and #2525 , and other places - it looked like this got merged to tensorflow_addons, but I don't see it in the last release. |
@davidsc-unity3d Nothing ever got merged into tf2 or tf-addons. I would recommend trying out this tool I made to do gradient accumulation in tf2: |
Thanks Andre, that looks really simple to use!
best,
David Schneider
…On Fri, Sep 20, 2024 at 8:05 AM André Pedersen ***@***.***> wrote:
I would like to try gradient accumulation on a codebase that is not yet
updated to keras 3. I have been looking at this PR and #2525
<#2525> , and other places - it
looked like this got merged to tensorflow_addons, but I don't see it in the
last release.
@davidsc-unity3d <https://github.com/davidsc-unity3d> Nothing ever got
merged into tf2 or tf-addons.
Only in Keras 3 was there ever added an official implementation for
gradient accumulation.
I would recommend trying out this tool I made to do gradient accumulation
in tf2:
https://github.com/andreped/GradientAccumulator
—
Reply to this email directly, view it on GitHub
<#2260 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG3W5TUQXAMEFFVACWATG7LZXQ2SRAVCNFSM6AAAAABOSHANBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRTHE2DKNJSGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Describe the feature and the current behavior/state.
Hi, I think it's good if someone can support Gradient Accumulate optimizer for this repo, this feature is really helpful for those who train the large model with a low resource such as Bert, etc. The usage should be similar with
tfa.optimizer.SWA
:There is an implementation of gradient accumulator but for
custom training loop
rather than Keras model fit here link.Relevant information
custom training loop.
Which API type would this fall under (layer, metric, optimizer, etc.)
optimizer
Who will benefit with this feature?
all tensorflow users.
Any other info.
The text was updated successfully, but these errors were encountered: