-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turn off debugger hooks in PyTorch? #401
Comments
Uninstalling smdebug from the image worked. Upgrading smdebug to latest version did not work. |
Hi @austinmw . Thanks for reporting this. Can you please provide the example script to reproduce this. |
I used the custom framework paradigm so it's a package of about 30 python files that I'm unable to share. Could try to put together a minimum reproducible example but it'd take a while. The underlying model is partially based on https://github.com/ifzhang/FairMOT (using HRNet-18 backbone). |
I have the same problem with pytorch in sagemaker. Although, my code works correctly in local device in sagemaker problem is emerged.AttributeError Traceback (most recent call last) ~/face_estimation_deployment/Invoke_AI.py in init(self, img_path, cp_path, iSpath, isCuda) ~/face_estimation_deployment/Invoke_AI.py in __check_value(self) ~/face_estimation_deployment/segment_Face.py in checker(self) ~/face_estimation_deployment/segment_Face.py in __start_PROCESS(self) /opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) /opt/conda/lib/python3.6/site-packages/torch/utils/smdebug.py in get_smdebug_hook() /opt/conda/lib/python3.6/site-packages/smdebug/pytorch/init.py in /opt/conda/lib/python3.6/site-packages/smdebug/trials/init.py in /opt/conda/lib/python3.6/site-packages/smdebug/trials/local_trial.py in /opt/conda/lib/python3.6/site-packages/smdebug/core/index_reader.py in /opt/conda/lib/python3.6/site-packages/smdebug/core/tfrecord/tensor_reader.py in /opt/conda/lib/python3.6/site-packages/smdebug/core/tfevent/event_file_reader.py in AttributeError: module 'smdebug' has no attribute 'core' |
Hi, I'm training with sagemaker using a custom docker image I created by extending the pytorch 1.6 training image with additional ssh settings for Horovod. I'm also using a custom estimator which I created by subclassing the pytorch estimator and adding a distribution parameter and configuration method.
When I launch single-node training everything works fine, but when I attempt to launch multi-node training, I get smdebug hook errors, even though I did not set any debugger rules. How can I turn off this functionality altogether?
The text was updated successfully, but these errors were encountered: