Crash while attempting restart #382

turbosonics · 2023-10-31T15:41:22Z

turbosonics
Oct 31, 2023

Hello,

Our cluster has 1 week wall time limit, and my model is big, so my training halted before it converges. I tried to restart the training job to continue the training. But when I put "append: True" in input yaml file and rerun the training, I got a crash in train.err file:

Traceback (most recent call last):
  File "/home/user/venv_ase_nequip_gpunode/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/home/user/venv_ase_nequip_gpunode/lib/python3.9/site-packages/nequip/scripts/train.py", line 74, in main
    trainer = restart(config)
  File "/home/user/venv_ase_nequip_gpunode/lib/python3.9/site-packages/nequip/scripts/train.py", line 220, in restart
    raise ValueError(
ValueError: Key "append" is different in config and the result trainer.pth file. Please double check

How can I escape from this situation? I'm training force, energy, and stress.

Answered by Linux-cpp-lisp

Oct 31, 2023

Has the learning rate dropped yet? If all the settings are as they were, you can do this as a "pretraining" in the style of #235

In general append: True should always be set if you want to be able to restart... I should probably change the default to True.

View full answer

Linux-cpp-lisp · 2023-10-31T18:13:25Z

Linux-cpp-lisp
Oct 31, 2023
Maintainer

Has the learning rate dropped yet? If all the settings are as they were, you can do this as a "pretraining" in the style of #235

In general append: True should always be set if you want to be able to restart... I should probably change the default to True.

1 reply

Linux-cpp-lisp Oct 31, 2023
Maintainer

Otherwise you can probably hack this by changing config.yaml in the traindir though I'm not sure if that will have any unintended consequences

turbosonics · 2023-10-31T18:24:39Z

turbosonics
Oct 31, 2023
Author

Has the learning rate dropped yet? If all the settings are as they were, you can do this as a "pretraining" in the style of #235

In general append: True should always be set if you want to be able to restart... I should probably change the default to True.

My restart attempt used the exact same parameters, didn't change anything to yaml file, except I put "append: true" in the middle. Maybe I misunderstood the documentation about restart, I thought the continuation run should have that "append: true" line, not the first run. IMO documentation about continuation of any training run with & without changing parameters would be greatly helpful, because many server clusters have wall time limit, and sometimes people want to train using large size input geometries.

Let me try with method of "pretraining".

May I ask what files are required to continue the stopped training job, if I want to continue the training from separate folder?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash while attempting restart #382

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Crash while attempting restart #382

turbosonics Oct 31, 2023

Replies: 2 comments · 1 reply

Linux-cpp-lisp Oct 31, 2023 Maintainer

Linux-cpp-lisp Oct 31, 2023 Maintainer

turbosonics Oct 31, 2023 Author

turbosonics
Oct 31, 2023

Replies: 2 comments 1 reply

Linux-cpp-lisp
Oct 31, 2023
Maintainer

Linux-cpp-lisp Oct 31, 2023
Maintainer

turbosonics
Oct 31, 2023
Author