Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error during training #5

Open
Facebear-ljx opened this issue Dec 2, 2023 · 0 comments
Open

CUDA error during training #5

Facebear-ljx opened this issue Dec 2, 2023 · 0 comments

Comments

@Facebear-ljx
Copy link

Hi,

thanks for sharing the code and the model!

I encountered some issues when finetuning on the realrobot dataset. The error occurs intermittently and may not occur at other times.

Error executing job with overrides: ['training=finetune', 'dataset=realrobot']
Traceback (most recent call last):
  File "/home/dodo/ljx/LIV/liv/train_liv.py", line 194, in main
    workspace.train()
  File "/home/dodo/ljx/LIV/liv/train_liv.py", line 100, in train
    metrics, st = trainer.update(self.model, batch, self.global_step)
  File "/home/dodo/ljx/LIV/liv/trainer.py", line 122, in update
    model.module.encoder_opt.step()
  File "/home/dodo/miniconda3/envs/liv-env/lib/python3.9/site-packages/torch/optim/optimizer.py", line 140, in wrapper
    out = func(*args, **kwargs)
  File "/home/dodo/miniconda3/envs/liv-env/lib/python3.9/site-packages/torch/optim/optimizer.py", line 23, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/dodo/miniconda3/envs/liv-env/lib/python3.9/site-packages/torch/optim/adam.py", line 218, in step
    state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant