duplicated calculation in gradient #91

MaryAhn · 2023-11-15T14:36:08Z

Hi, I have tried to run the code according to Usage in this repo:
`args = parse_args()
num_gpus = int(os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1
args.num_gpus = num_gpus
args.distributed = num_gpus > 1
if torch.cuda.is_available():
cudnn.benchmark = False
args.device = "cuda"
else:
args.distributed = False
args.device = "cpu"
if args.distributed:
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend="nccl", init_method="env://")
synchronize()

train_loader = get_loader(args=args)

model = get_model(args)
learner = SelfSupervisedLearner(
model,
image_size=480,
hidden_layer='module.avgpool',
projection_size = 256,
projection_hidden_size = args.hidden_size,
moving_average_decay = 0.99
)

opt = torch.optim.Adam(learner.parameters(), lr=3e-4)

if not os.path.exists(args.model_dir):
os.makedirs(args.model_dir)

for _ in range(args.epochs):
for idx, images in enumerate(train_loader):
if torch.cuda.is_available():
images = images.cuda(non_blocking=True)
loss = learner(images)
opt.zero_grad()
loss.backward()
opt.step()
learner.update_moving_average() # update moving average of target encoder

save your improved network

torch.save(model.state_dict(), './improved-net.pt')`

However, After run this code with distributed learning, during backward(), I got this error message repeated:

Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I used detach().clone() instead of detach() in byol_pytorch.py, I got same error. Even if I set torch.autograd.set_detect_anomaly(True), I could not get what is the reason. Would you let me know what part of this code invokes this problem? Thanks in advance.

The text was updated successfully, but these errors were encountered:

lucidrains · 2023-11-15T15:05:34Z

@MaryAhn hey Keonhee, actually not too sure

do you want to give 0.7.1 a quick try? if it still does not work, i'll debug it once i get access to my multi-GPU setup

MaryAhn · 2023-11-15T15:07:57Z

Yes, I would like to try 0.7.1. My email is [email protected].

lucidrains · 2023-11-15T15:08:48Z

@MaryAhn just pip install byol-pytorch -U for 0.7.1

MaryAhn · 2023-11-16T03:17:45Z

I installed 0.7.1 through the statement you gave, however, it still does not work. If you find the problem and solution via debugging, please let me know.

lucidrains · 2023-11-16T03:20:51Z

after setting torch.autograd.set_detect_anomaly(True), do you see a different error trace? could you paste that trace if so?

MaryAhn · 2023-11-16T03:25:52Z

I set detect anomaly, and the error message is:
File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 129, in <module> loss = learner(images) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward else self._run_ddp_forward(*inputs, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward return self.module(*inputs, **kwargs) # type: ignore[index] File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/keonhee/byol-pytorch/byol_pytorch/byol_pytorch.py", line 264, in forward online_proj_one, _ = self.online_encoder(image_one) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/keonhee/byol-pytorch/byol_pytorch/byol_pytorch.py", line 165, in forward representation = self.get_representation(x) File "/home/cvlab/keonhee/byol-pytorch/byol_pytorch/byol_pytorch.py", line 157, in get_representation _ = self.net(x) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward else self._run_ddp_forward(*inputs, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward return self.module(*inputs, **kwargs) # type: ignore[index] File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torchvision/models/resnet.py", line 285, in forward return self._forward_impl(x) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torchvision/models/resnet.py", line 276, in _forward_impl x = self.layer4(x) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/container.py", line 215, in forward input = module(input) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torchvision/models/resnet.py", line 155, in forward out = self.bn3(out) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 171, in forward return F.batch_norm( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/functional.py", line 2478, in batch_norm return torch.batch_norm( (Triggered internally at /opt/conda/conda-bld/pytorch_1695392035629/work/torch/csrc/autograd/python_anomaly_mode.cpp:114.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck! Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I think the backward process of online encoder or online predictor does not work appropriately, but I'm not sure.

lucidrains · 2023-11-16T03:51:16Z

want to try 0.7.2? it may or may not do anything

i can try to debug this once i get back on my multi-gpu machine

MaryAhn · 2023-11-16T03:53:54Z

Yes I want to try. pip install byol-pytorch==0.7.2 or pip install byol-pytorch -U which one should I use?

lucidrains · 2023-11-16T03:54:20Z

@MaryAhn both should work

MaryAhn · 2023-11-16T03:55:50Z

After installation, still same error occurs. After debugging, please let me know about this issue. Thank you.

lucidrains · 2023-11-16T04:02:55Z

@MaryAhn i see you are using a custom script

are you not using pytorch lightning? there's a setting in there to replace batchnorms in your resnet with sync batchnorms

lucidrains · 2023-11-16T04:04:32Z

otherwise, this could also be related to an ongoing pytorch issue, and you could try setting broadcast_buffers = False for DistributedDataParallel

lucidrains · 2023-11-16T04:07:26Z

you shouldn't need to write the training loop as in your initial comment if you just modify this file and run trainer.fit()

try using that lightning script as is with your resnet, and if the issue persists, then that would tell me a lot

lucidrains · 2023-11-16T15:22:30Z

would it help if i offered a huggingface accelerate version? i find accelerate much more hackable

lucidrains · 2023-11-16T19:20:56Z

@MaryAhn try 0.8.0 following the instructions here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicated calculation in gradient #91

duplicated calculation in gradient #91

MaryAhn commented Nov 15, 2023

lucidrains commented Nov 15, 2023 •

edited

Loading

MaryAhn commented Nov 15, 2023

lucidrains commented Nov 15, 2023

MaryAhn commented Nov 16, 2023

lucidrains commented Nov 16, 2023

MaryAhn commented Nov 16, 2023

lucidrains commented Nov 16, 2023

MaryAhn commented Nov 16, 2023

lucidrains commented Nov 16, 2023

MaryAhn commented Nov 16, 2023

lucidrains commented Nov 16, 2023

lucidrains commented Nov 16, 2023

lucidrains commented Nov 16, 2023 •

edited

Loading

lucidrains commented Nov 16, 2023

lucidrains commented Nov 16, 2023

duplicated calculation in gradient #91

duplicated calculation in gradient #91

Comments

MaryAhn commented Nov 15, 2023

save your improved network

lucidrains commented Nov 15, 2023 • edited Loading

MaryAhn commented Nov 15, 2023

lucidrains commented Nov 15, 2023

MaryAhn commented Nov 16, 2023

lucidrains commented Nov 16, 2023

MaryAhn commented Nov 16, 2023

lucidrains commented Nov 16, 2023

MaryAhn commented Nov 16, 2023

lucidrains commented Nov 16, 2023

MaryAhn commented Nov 16, 2023

lucidrains commented Nov 16, 2023

lucidrains commented Nov 16, 2023

lucidrains commented Nov 16, 2023 • edited Loading

lucidrains commented Nov 16, 2023

lucidrains commented Nov 16, 2023

lucidrains commented Nov 15, 2023 •

edited

Loading

lucidrains commented Nov 16, 2023 •

edited

Loading