You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
The distributed training setup functions correctly with OneDeviceStrategy and MirroredStrategy. However, when transitioning to MultiWorkerMirroredStrategy, the local_replica_id fails to return a valid value. Instead, it returns None, and the tensor ("while/cond/replica_id_in_sync_group:0") appears to be empty.
Details:
Environment:
TensorFlow Version: 2.4.0
Cluster Setup: Multi-node with 2 nodes
Strategies Tested:
OneDeviceStrategy: Successful execution
MirroredStrategy: Successful execution
MultiWorkerMirroredStrategy: Fails with None for local_replica_id
Issue Description:
When utilizing MultiWorkerMirroredStrategy, the local_replica_id is not assigned correctly, resulting in a value of None. Additionally, the tensor ("while/cond/replica_id_in_sync_group:0") is observed to be empty. This issue disrupts the synchronous training process across multiple workers.
Configure the cluster environment with appropriate TF_CONFIG settings for multi-node operation.
Initialize MultiWorkerMirroredStrategy.
Execute the training script designed for distributed training.
Observe the failure to assign a valid local_replica_id and the resulting empty tensor value.
num_gpus=8
num_workers=2
# $WORKER_ID will be 0 to host0 and 1 to host1.
TF_CONFIG="{\"cluster\": {\"worker\": [\"host0:12345\", \"host1:12345\"]}, \"task\": {\"type\": \"worker\", \"index\": $WORKER_ID}} \python training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py \ --distribution_strategy=multi_worker_mirrored \ --all_reduce_alg=nccl \ --batch_size=$((128*$num_gpus*$num_workers))\ --enable_eager \ --num_gpus=$num_gpus\ --lr_schedule=polynomial \ --optimizer=LARS
Expected Behavior:
The local_replica_id should be correctly assigned for each worker in the cluster, enabling proper synchronization and distributed training.
Observed Behavior:
The local_replica_id is None, leading to an empty tensor for "while/cond/replica_id_in_sync_group:0".
INFO:tensorflow:Error reported to Coordinator: tuple indices must be integers or slices, not NoneType
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 323, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 146, in _apply_grads_and_clear_for_each_replica
ag__.for_stmt(ag__.converted_call(ag__.ld(zip), (ag__.ld(self).accum_grads, ag__.ld(self).training_vars), None, fscope_3), None, loop_body, get_state_5, set_state_5, (), {'iterate_names': '(accum_grad, var)'})
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 444, in for_stmt
_py_for_stmt(iter_, extra_test, body, None, None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 473, in _py_for_stmt
body(target)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 459, in protected_body
original_body(protected_iter)
File "/tmp/tmpsgvrk_5a.py", line 139, in loop_body
replica_accum_grad = ag__.ld(local_accum_grad)[ag__.ld(local_replica_id)]
TypeError: tuple indices must be integers or slices, not NoneType
I0521 02:14:08.696481 139974980654848 coordinator.py:219] Error reported to Coordinator: tuple indices must be integers or slices, not NoneType
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 323, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 146, in _apply_grads_and_clear_for_each_replica
ag__.for_stmt(ag__.converted_call(ag__.ld(zip), (ag__.ld(self).accum_grads, ag__.ld(self).training_vars), None, fscope_3), None, loop_body, get_state_5, set_state_5, (), {'iterate_names': '(accum_grad, var)'})
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 444, in for_stmt
_py_for_stmt(iter_, extra_test, body, None, None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 473, in _py_for_stmt
body(target)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 459, in protected_body
original_body(protected_iter)
File "/tmp/tmpsgvrk_5a.py", line 139, in loop_body
replica_accum_grad = ag__.ld(local_accum_grad)[ag__.ld(local_replica_id)]
TypeError: tuple indices must be integers or slices, not NoneType
INFO:tensorflow:Error reported to Coordinator: tuple indices must be integers or slices, not NoneType
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 228, in _call_for_each_replica
**merge_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 186, in _maybe_apply_grads_and_clear
ag__.converted_call(ag__.ld(tf).cond, (ag__.converted_call(ag__.ld(tf).equal, ((ag__.ld(self).optimizer.iterations % ag__.ld(self).num_accumulation_steps), (ag__.ld(self).num_accumulation_steps -1)), None, fscope_2), ag__.ld(_apply_grads_and_clear), ag__.ld(_advance_iteration)), None, fscope_2)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 396, in converted_call
return _call_unconverted(f, args, kwargs, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 479, in _call_unconverted
return f(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1396, in cond_for_tf_v2
return cond(pred, true_fn=true_fn, false_fn=false_fn, strict=True, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 538, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1180, in cond
return cond_v2.cond_v2(pred, true_fn, false_fn, name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/cond_v2.py", line 89, in cond_v2
op_return_value=pred)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 165, in _apply_grads_and_clear
ag__.converted_call(ag__.ld(distribution).extended.call_for_each_replica, (ag__.ld(_apply_grads_and_clear_for_each_replica),), dict(args=()), fscope_4)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 396, in converted_call
return _call_unconverted(f, args, kwargs, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 478, in _call_unconverted
return f(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2730, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 629, in _call_for_each_replica
self._container_strategy(), fn, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 93, in call_for_each_replica
return _call_for_each_replica(strategy, fn, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 234, in _call_for_each_replica
coord.join(threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 323, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 146, in _apply_grads_and_clear_for_each_replica
ag__.for_stmt(ag__.converted_call(ag__.ld(zip), (ag__.ld(self).accum_grads, ag__.ld(self).training_vars), None, fscope_3), None, loop_body, get_state_5, set_state_5, (), {'iterate_names': '(accum_grad, var)'})
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 444, in for_stmt
_py_for_stmt(iter_, extra_test, body, None, None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 473, in _py_for_stmt
body(target)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 459, in protected_body
original_body(protected_iter)
File "/tmp/tmpsgvrk_5a.py", line 139, in loop_body
replica_accum_grad = ag__.ld(local_accum_grad)[ag__.ld(local_replica_id)]
TypeError: tuple indices must be integers or slices, not NoneType
I0521 02:14:08.698780 140697669199680 coordinator.py:219] Error reported to Coordinator: tuple indices must be integers or slices, not NoneType
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 228, in _call_for_each_replica
**merge_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 186, in _maybe_apply_grads_and_clear
ag__.converted_call(ag__.ld(tf).cond, (ag__.converted_call(ag__.ld(tf).equal, ((ag__.ld(self).optimizer.iterations % ag__.ld(self).num_accumulation_steps), (ag__.ld(self).num_accumulation_steps -1)), None, fscope_2), ag__.ld(_apply_grads_and_clear), ag__.ld(_advance_iteration)), None, fscope_2)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 396, in converted_call
return _call_unconverted(f, args, kwargs, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 479, in _call_unconverted
return f(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1396, in cond_for_tf_v2
return cond(pred, true_fn=true_fn, false_fn=false_fn, strict=True, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 538, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1180, in cond
return cond_v2.cond_v2(pred, true_fn, false_fn, name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/cond_v2.py", line 89, in cond_v2
op_return_value=pred)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 165, in _apply_grads_and_clear
ag__.converted_call(ag__.ld(distribution).extended.call_for_each_replica, (ag__.ld(_apply_grads_and_clear_for_each_replica),), dict(args=()), fscope_4)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 396, in converted_call
return _call_unconverted(f, args, kwargs, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 478, in _call_unconverted
return f(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2730, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 629, in _call_for_each_replica
self._container_strategy(), fn, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 93, in call_for_each_replica
return _call_for_each_replica(strategy, fn, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 234, in _call_for_each_replica
coord.join(threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 323, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 146, in _apply_grads_and_clear_for_each_replica
ag__.for_stmt(ag__.converted_call(ag__.ld(zip), (ag__.ld(self).accum_grads, ag__.ld(self).training_vars), None, fscope_3), None, loop_body, get_state_5, set_state_5, (), {'iterate_names': '(accum_grad, var)'})
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 444, in for_stmt
_py_for_stmt(iter_, extra_test, body, None, None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 473, in _py_for_stmt
body(target)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 459, in protected_body
original_body(protected_iter)
File "/tmp/tmpsgvrk_5a.py", line 139, in loop_body
replica_accum_grad = ag__.ld(local_accum_grad)[ag__.ld(local_replica_id)]
TypeError: tuple indices must be integers or slices, not NoneType
tf.distribute.get_replica_context().replica_id_in_sync_group: Tensor("while/cond/replica_id_in_sync_group:0", shape=(), dtype=int32, device=/job:worker/replica:0/task:0/device:GPU:0)
local_replica_id: None
Traceback (most recent call last):
File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 269, in<module>
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 262, in main
stats = run(flags.FLAGS)
File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 244, in run
resnet_controller.train(evaluate=not flags_obj.skip_eval)
File "/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/controller.py", line 257, in train
train_outputs = self.train_fn(steps_per_loop)
File "/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/standard_runnable.py", line 65, in train
self.train_loop_fn(self.train_iter, num_steps)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 871, in _call
self._initialize(args, kwds, add_initializers_to=initializers)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize
*args, **kwds))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn
out = weak_wrapped_fn().__wrapped__(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper
raise e.ag_error_metadata.to_exception(e)
TypeError: in user code:
/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/utils.py:91 loop_fn *
step_fn(iterator)
/home/work/mlperf/training/image_classification/tensorflow2/resnet_runnable.py:328 _apply_grads_and_clear_for_each_replica *
replica_accum_grad = local_accum_grad[local_replica_id]
TypeError: tuple indices must be integers or slices, not NoneType
Impact:
This issue prevents successful distributed training with MultiWorkerMirroredStrategy, limiting the ability to scale training across multiple nodes.
The text was updated successfully, but these errors were encountered:
Hello mlcommons team!
Summary:
The distributed training setup functions correctly with
OneDeviceStrategy
andMirroredStrategy
. However, when transitioning toMultiWorkerMirroredStrategy
, thelocal_replica_id
fails to return a valid value. Instead, it returnsNone
, and the tensor ("while/cond/replica_id_in_sync_group:0") appears to be empty.Details:
Environment:
OneDeviceStrategy
: Successful executionMirroredStrategy
: Successful executionMultiWorkerMirroredStrategy
: Fails withNone
forlocal_replica_id
Issue Description:
When utilizing
MultiWorkerMirroredStrategy
, thelocal_replica_id
is not assigned correctly, resulting in a value ofNone
. Additionally, the tensor ("while/cond/replica_id_in_sync_group:0") is observed to be empty. This issue disrupts the synchronous training process across multiple workers.training/image_classification/tensorflow2/resnet_runnable.py
Lines 312 to 314 in 87405ce
TF_CONFIG
settings for multi-node operation.MultiWorkerMirroredStrategy
.local_replica_id
and the resulting empty tensor value.Expected Behavior:
The
local_replica_id
should be correctly assigned for each worker in the cluster, enabling proper synchronization and distributed training.Observed Behavior:
The
local_replica_id
isNone
, leading to an empty tensor for "while/cond/replica_id_in_sync_group:0".Impact:
This issue prevents successful distributed training with
MultiWorkerMirroredStrategy
, limiting the ability to scale training across multiple nodes.The text was updated successfully, but these errors were encountered: