Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Op type not registered 'HbGetNcclId' in binary #159

Open
ZhuYuJin opened this issue Mar 28, 2024 · 0 comments
Open

Op type not registered 'HbGetNcclId' in binary #159

ZhuYuJin opened this issue Mar 28, 2024 · 0 comments

Comments

@ZhuYuJin
Copy link

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
self._extend_graph()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Op type not registered 'HbGetNcclId' in binary running on deeprec_with_io-hb. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resampler should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed. while building NodeDef 'collective_id_rpc_broadcast/replicas/0/collective_id_rpc_broadcast/replicas/0/GetCollectiveId'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "clk_model/clk_model_v4_gpu_debug.py", line 586, in
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "clk_model/clk_model_v4_gpu_debug.py", line 572, in main
model.run()
File "clk_model/clk_model_v4_gpu_debug.py", line 498, in run
classifier.train_and_evaluate(train_spec, eval_spec)
File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/estimator/estimator.py", line 334, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 646, in run
getattr(self, task_to_run)()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 652, in run_chief
return self._start_distributed_training()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 829, in _start_distributed_training
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/estimator/estimator.py", line 211, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 373, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1229, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1554, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/training/session.py", line 129, in HybridBackendMonitoredTrainingSession
sess = fn(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 660, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/training/session.py", line 64, in init
session_creator, hooks, should_recover=True, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 802, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1290, in init
_WrappedSession.init(self, self._create_session())
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1295, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 955, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 724, in create_session
init_fn=self._scaffold.init_fn)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 306, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Op type not registered 'HbGetNcclId' in binary running on deeprec_with_io-hb. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resampler should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed. while building NodeDef 'collective_id_rpc_broadcast/replicas/0/collective_id_rpc_broadcast/replicas/0/GetCollectiveId'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant