You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
When utilizing MultiWorkerMirroredStrategy in a distributed training setup, an IndexError is encountered during the execution of optimizer.apply_gradients(), specifically within the cross_device_ops component.
Issue Description:
During the training process, specifically at the point of executing optimizer.apply_gradients(), an IndexError is raised from the cross_device_ops component. This error disrupts the training workflow, preventing successful completion of the training process across multiple nodes.
Reproduction Steps:
Configure the cluster environment with appropriate TF_CONFIG settings for multi-node operation.
Initialize MultiWorkerMirroredStrategy within the training script.
Execute the training script which involves defining a model, compiling it, and calling model.fit() on the distributed dataset.
Observe the occurrence of IndexError during the optimizer.apply_gradients() call.
num_gpus=8
num_workers=2
# $WORKER_ID will be 0 to host0 and 1 to host1.
TF_CONFIG="{\"cluster\": {\"worker\": [\"host0:12345\", \"host1:12345\"]}, \"task\": {\"type\": \"worker\", \"index\": $WORKER_ID}} \python training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py \ --distribution_strategy=multi_worker_mirrored \ --all_reduce_alg=nccl \ --batch_size=$((128*$num_gpus*$num_workers))\ --enable_eager \ --num_gpus=$num_gpus\ --lr_schedule=polynomial \ --optimizer=LARS
Expected Behavior:
The optimizer.apply_gradients() should execute without errors, allowing the training process to proceed correctly across all nodes in the cluster.
Observed Behavior:
An IndexError is raised during the optimizer.apply_gradients() call, originating from the cross_device_ops, which disrupts the training process.
Impact:
This issue prevents the successful execution of distributed training with MultiWorkerMirroredStrategy, hindering the scalability and efficiency of the training process across multiple nodes.
Traceback (most recent call last):
File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 269, in<module>
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 262, in main
stats = run(flags.FLAGS)
File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 244, in run
resnet_controller.train(evaluate=not flags_obj.skip_eval)
File "/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/controller.py", line 257, in train
train_outputs = self.train_fn(steps_per_loop)
File "/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/standard_runnable.py", line 65, in train
self.train_loop_fn(self.train_iter, num_steps)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 871, in _call
self._initialize(args, kwds, add_initializers_to=initializers)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize
*args, **kwds))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn
out = weak_wrapped_fn().__wrapped__(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.autograph.impl.api.StagingError: in user code:
/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/utils.py:91 loop_fn *
step_fn(iterator)
/home/work/mlperf/training/image_classification/tensorflow2/resnet_runnable.py:350 _apply_grads_and_clear *
distribution.extended.call_for_each_replica(
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica **return self._call_for_each_replica(fn, args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py:629 _call_for_each_replica
self._container_strategy(), fn, args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py:93 call_for_each_replica
return _call_for_each_replica(strategy, fn, args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py:234 _call_for_each_replica
coord.join(threads)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py:389 join
six.reraise(*self._exc_info_to_raise)
/usr/local/lib/python3.6/dist-packages/six.py:703 reraise
raise value
/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py:297 stop_on_exception
yield
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py:228 _call_for_each_replica
**merge_kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/utils.py:152 _all_reduce_sum_fn **
grads_and_vars)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2374 batch_reduce_to
return self._batch_reduce_to(reduce_op, value_destination_pairs, options)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py:697 _batch_reduce_to
options=self._communication_options.merge(options))
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:426 batch_reduce
options)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1094 batch_reduce_implementation
forvalue, destin value_destination_pairs
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1094 <listcomp>forvalue, destin value_destination_pairs
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1050 reduce_implementation
options)[0]
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1103 _batch_all_reduce
options)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1142 _do_batch_all_reduce_dense
values_by_device[i].append(per_replica.values[i])
IndexError: tuple index out of range
Hello mlcommons teams!
When using
MultiWorkerMirroredStrategy
, it has been observed that thecross_device_ops
raises anIndexError
duringoptimizer.apply_gradient()
.training/image_classification/tensorflow2/resnet_runnable.py
Lines 323 to 324 in 87405ce
https://github.com/tensorflow/tensorflow/blob/64918868e2154b06c7479347a59a4230f785e9fa/tensorflow/python/distribute/cross_device_ops.py#L1140-L1142
Summary:
When utilizing
MultiWorkerMirroredStrategy
in a distributed training setup, anIndexError
is encountered during the execution ofoptimizer.apply_gradients()
, specifically within thecross_device_ops
component.Details:
Environment:
MultiWorkerMirroredStrategy
Issue Description:
During the training process, specifically at the point of executing
optimizer.apply_gradients()
, anIndexError
is raised from thecross_device_ops
component. This error disrupts the training workflow, preventing successful completion of the training process across multiple nodes.Reproduction Steps:
TF_CONFIG
settings for multi-node operation.MultiWorkerMirroredStrategy
within the training script.model.fit()
on the distributed dataset.IndexError
during theoptimizer.apply_gradients()
call.Expected Behavior:
The
optimizer.apply_gradients()
should execute without errors, allowing the training process to proceed correctly across all nodes in the cluster.Observed Behavior:
An
IndexError
is raised during theoptimizer.apply_gradients()
call, originating from thecross_device_ops
, which disrupts the training process.Impact:
This issue prevents the successful execution of distributed training with
MultiWorkerMirroredStrategy
, hindering the scalability and efficiency of the training process across multiple nodes.Refs
local_replica_id
withMultiWorkerMirroredStrategy
#739The text was updated successfully, but these errors were encountered: