Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The program crashes when I use the argument "--load_optimizer" #18

Open
nixonjin opened this issue Jun 26, 2022 · 2 comments
Open

The program crashes when I use the argument "--load_optimizer" #18

nixonjin opened this issue Jun 26, 2022 · 2 comments

Comments

@nixonjin
Copy link

I wanted to continue training the model with the saved optimizer, but it crashed. The traceback is shown as follows:
Traceback (most recent call last):
File "lgesql/text2sql.py", line 105, in
optimizer.step()
File "lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "lib/python3.6/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "lgesql/utils/optimization.py", line 220, in step
exp_avg.mul_(beta1).add_(grad, alpha=1.0 - beta1)
RuntimeError: The size of tensor a (768) must match the size of tensor b (2) at non-singleton dimension 0

Have you met this problem? And how can I fix it?

@nixonjin
Copy link
Author

More information:
When I comment the code "optimizer.load_state_dict(check_point['optim'])", the program will not crash but the training loss will be much larger than the loss in the last epoch of the saved model.

@rhythmcao
Copy link
Collaborator

Thanks a lot for pointing out this bug.

We also find this problem when loading from checkpoints. Honestly, we never used this interface for training from checkpoints in our experiments and neglected this bug by accident. The problem is caused by mismatches about key-value pairs in self.state of the optimizer. And the cause is that the set() operations over the parameters in function set_optimizer lead to different orders when invoked in different runs. Thus, the self.state mappings in load_state_dict for the optimizer fails. (See load_state_dict in Pytorch Optimizer for more details)

We have fixed this bug by removing all set() operations in function set_optimizer in utils/optimization.py. And everything seems ok if you now train from scratch and load from checkpoints.

Thanks again for pointing out this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants