-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duplicated calculation in gradient #91
Comments
@MaryAhn hey Keonhee, actually not too sure do you want to give 0.7.1 a quick try? if it still does not work, i'll debug it once i get access to my multi-GPU setup |
Yes, I would like to try 0.7.1. My email is [email protected]. |
@MaryAhn just |
I installed 0.7.1 through the statement you gave, however, it still does not work. If you find the problem and solution via debugging, please let me know. |
after setting |
I set detect anomaly, and the error message is:
I think the backward process of online encoder or online predictor does not work appropriately, but I'm not sure. |
want to try 0.7.2? it may or may not do anything i can try to debug this once i get back on my multi-gpu machine |
Yes I want to try. |
@MaryAhn both should work |
After installation, still same error occurs. After debugging, please let me know about this issue. Thank you. |
@MaryAhn i see you are using a custom script are you not using pytorch lightning? there's a setting in there to replace batchnorms in your resnet with sync batchnorms |
otherwise, this could also be related to an ongoing pytorch issue, and you could try setting |
you shouldn't need to write the training loop as in your initial comment if you just modify this file and run try using that lightning script as is with your resnet, and if the issue persists, then that would tell me a lot |
would it help if i offered a huggingface accelerate version? i find accelerate much more hackable |
Hi, I have tried to run the code according to Usage in this repo:
`args = parse_args()
num_gpus = int(os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1
args.num_gpus = num_gpus
args.distributed = num_gpus > 1
if torch.cuda.is_available():
cudnn.benchmark = False
args.device = "cuda"
else:
args.distributed = False
args.device = "cpu"
if args.distributed:
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend="nccl", init_method="env://")
synchronize()
train_loader = get_loader(args=args)
model = get_model(args)
learner = SelfSupervisedLearner(
model,
image_size=480,
hidden_layer='module.avgpool',
projection_size = 256,
projection_hidden_size = args.hidden_size,
moving_average_decay = 0.99
)
opt = torch.optim.Adam(learner.parameters(), lr=3e-4)
if not os.path.exists(args.model_dir):
os.makedirs(args.model_dir)
for _ in range(args.epochs):
for idx, images in enumerate(train_loader):
if torch.cuda.is_available():
images = images.cuda(non_blocking=True)
loss = learner(images)
opt.zero_grad()
loss.backward()
opt.step()
learner.update_moving_average() # update moving average of target encoder
save your improved network
torch.save(model.state_dict(), './improved-net.pt')`
However, After run this code with distributed learning, during backward(), I got this error message repeated:
Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
I used detach().clone() instead of detach() in byol_pytorch.py, I got same error. Even if I set torch.autograd.set_detect_anomaly(True), I could not get what is the reason. Would you let me know what part of this code invokes this problem? Thanks in advance.
The text was updated successfully, but these errors were encountered: