Crash while attempting restart #382
-
Hello, Our cluster has 1 week wall time limit, and my model is big, so my training halted before it converges. I tried to restart the training job to continue the training. But when I put "append: True" in input yaml file and rerun the training, I got a crash in train.err file:
How can I escape from this situation? I'm training force, energy, and stress. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Has the learning rate dropped yet? If all the settings are as they were, you can do this as a "pretraining" in the style of #235 In general |
Beta Was this translation helpful? Give feedback.
-
My restart attempt used the exact same parameters, didn't change anything to yaml file, except I put "append: true" in the middle. Maybe I misunderstood the documentation about restart, I thought the continuation run should have that "append: true" line, not the first run. IMO documentation about continuation of any training run with & without changing parameters would be greatly helpful, because many server clusters have wall time limit, and sometimes people want to train using large size input geometries. Let me try with method of "pretraining". May I ask what files are required to continue the stopped training job, if I want to continue the training from separate folder? |
Beta Was this translation helpful? Give feedback.
Has the learning rate dropped yet? If all the settings are as they were, you can do this as a "pretraining" in the style of #235
In general
append: True
should always be set if you want to be able to restart... I should probably change the default to True.