Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicated calculation in gradient #91

Open
MaryAhn opened this issue Nov 15, 2023 · 15 comments
Open

duplicated calculation in gradient #91

MaryAhn opened this issue Nov 15, 2023 · 15 comments

Comments

@MaryAhn
Copy link

MaryAhn commented Nov 15, 2023

Hi, I have tried to run the code according to Usage in this repo:
`args = parse_args()
num_gpus = int(os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1
args.num_gpus = num_gpus
args.distributed = num_gpus > 1
if torch.cuda.is_available():
cudnn.benchmark = False
args.device = "cuda"
else:
args.distributed = False
args.device = "cpu"
if args.distributed:
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend="nccl", init_method="env://")
synchronize()

train_loader = get_loader(args=args)

model = get_model(args)
learner = SelfSupervisedLearner(
model,
image_size=480,
hidden_layer='module.avgpool',
projection_size = 256,
projection_hidden_size = args.hidden_size,
moving_average_decay = 0.99
)

opt = torch.optim.Adam(learner.parameters(), lr=3e-4)

if not os.path.exists(args.model_dir):
os.makedirs(args.model_dir)

for _ in range(args.epochs):
for idx, images in enumerate(train_loader):
if torch.cuda.is_available():
images = images.cuda(non_blocking=True)
loss = learner(images)
opt.zero_grad()
loss.backward()
opt.step()
learner.update_moving_average() # update moving average of target encoder

save your improved network

torch.save(model.state_dict(), './improved-net.pt')`

However, After run this code with distributed learning, during backward(), I got this error message repeated:

Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I used detach().clone() instead of detach() in byol_pytorch.py, I got same error. Even if I set torch.autograd.set_detect_anomaly(True), I could not get what is the reason. Would you let me know what part of this code invokes this problem? Thanks in advance.

@lucidrains
Copy link
Owner

lucidrains commented Nov 15, 2023

@MaryAhn hey Keonhee, actually not too sure

do you want to give 0.7.1 a quick try? if it still does not work, i'll debug it once i get access to my multi-GPU setup

@MaryAhn
Copy link
Author

MaryAhn commented Nov 15, 2023

Yes, I would like to try 0.7.1. My email is [email protected].

@lucidrains
Copy link
Owner

@MaryAhn just pip install byol-pytorch -U for 0.7.1

@MaryAhn
Copy link
Author

MaryAhn commented Nov 16, 2023

I installed 0.7.1 through the statement you gave, however, it still does not work. If you find the problem and solution via debugging, please let me know.

@lucidrains
Copy link
Owner

after setting torch.autograd.set_detect_anomaly(True), do you see a different error trace? could you paste that trace if so?

@MaryAhn
Copy link
Author

MaryAhn commented Nov 16, 2023

I set detect anomaly, and the error message is:
File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 129, in <module> loss = learner(images) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward else self._run_ddp_forward(*inputs, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward return self.module(*inputs, **kwargs) # type: ignore[index] File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/keonhee/byol-pytorch/byol_pytorch/byol_pytorch.py", line 264, in forward online_proj_one, _ = self.online_encoder(image_one) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/keonhee/byol-pytorch/byol_pytorch/byol_pytorch.py", line 165, in forward representation = self.get_representation(x) File "/home/cvlab/keonhee/byol-pytorch/byol_pytorch/byol_pytorch.py", line 157, in get_representation _ = self.net(x) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward else self._run_ddp_forward(*inputs, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward return self.module(*inputs, **kwargs) # type: ignore[index] File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torchvision/models/resnet.py", line 285, in forward return self._forward_impl(x) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torchvision/models/resnet.py", line 276, in _forward_impl x = self.layer4(x) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/container.py", line 215, in forward input = module(input) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torchvision/models/resnet.py", line 155, in forward out = self.bn3(out) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 171, in forward return F.batch_norm( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/nn/functional.py", line 2478, in batch_norm return torch.batch_norm( (Triggered internally at /opt/conda/conda-bld/pytorch_1695392035629/work/torch/csrc/autograd/python_anomaly_mode.cpp:114.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck! Traceback (most recent call last): File "/home/cvlab/keonhee/byol-pytorch/train_custom.py", line 131, in <module> loss.backward() File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/cvlab/anaconda3/envs/sc1/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 7; expected version 6 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I think the backward process of online encoder or online predictor does not work appropriately, but I'm not sure.

@lucidrains
Copy link
Owner

want to try 0.7.2? it may or may not do anything

i can try to debug this once i get back on my multi-gpu machine

@MaryAhn
Copy link
Author

MaryAhn commented Nov 16, 2023

Yes I want to try. pip install byol-pytorch==0.7.2 or pip install byol-pytorch -U which one should I use?

@lucidrains
Copy link
Owner

@MaryAhn both should work

@MaryAhn
Copy link
Author

MaryAhn commented Nov 16, 2023

After installation, still same error occurs. After debugging, please let me know about this issue. Thank you.

@lucidrains
Copy link
Owner

@MaryAhn i see you are using a custom script

are you not using pytorch lightning? there's a setting in there to replace batchnorms in your resnet with sync batchnorms

@lucidrains
Copy link
Owner

otherwise, this could also be related to an ongoing pytorch issue, and you could try setting broadcast_buffers = False for DistributedDataParallel

@lucidrains
Copy link
Owner

lucidrains commented Nov 16, 2023

you shouldn't need to write the training loop as in your initial comment if you just modify this file and run trainer.fit()

try using that lightning script as is with your resnet, and if the issue persists, then that would tell me a lot

@lucidrains
Copy link
Owner

would it help if i offered a huggingface accelerate version? i find accelerate much more hackable

@lucidrains
Copy link
Owner

@MaryAhn try 0.8.0 following the instructions here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants