Huge bumps in learning curves #74

tbraeckevelt · 2021-08-24T12:45:28Z

tbraeckevelt
Aug 24, 2021

Dear developers

I have a problem while training the MLP. I get large bumps for the MAE of the forces. I know that it is not unusual to get some bumps while training, but usually the error goes back to the original value before the bump, quite fast. This is does not happen, it takes a few hours and in certain cases a new bump occurs befor reaching the old optimal error. See for example the figure below.

I know that I could change for example learning rate, or batch size or even restart the training from the optimal value before the bump. But I was wondering if this is something you saw while training MLPs or if you know what might cause this?

I ask because I am using almost the default settings of the "full.yaml" you provided (I only changed r_max to 6.0), so I would think that the settings are already quite good. However, I got this strange behavior for two different systems (CsPbI3 and FAPbI3) and two different training sets sizes (300 and 15000 structures). In the zip.file you can find the full.yaml file and the logs of the training belonging to the figure above.

in_and_output_data.zip

Kind regards

Tom

Answered by tbraeckevelt

Sep 2, 2021

Thanks for the answer.

I tested your suggestions: learning rate (LR) 0.005 and 0.001, and batch size (BS) 10 and 15:

I come to the same conclusions as you did:

Increasing batch size did nothing to resolve the issue, it only made the training slower.
Decreasing the learning rate did lower (or remove) those peaks, but this appears to be a trade-off because LR=0.001 really slowed down the training. LR=0.005 lead to the lowest MAE, but it did still show a peak, although not to high (only lost two hours to get back to the initial MAE before the peak). LR=0.01 is a bit faster, but also a gamble to not encouter such a huge peak.

I will do some more test on other systems, but for now I will us…

View full answer

Linux-cpp-lisp · 2021-08-24T15:38:13Z

Linux-cpp-lisp
Aug 24, 2021
Maintainer

In general I believe this is a sign that your learning rate is too aggressive, but @simonbatzner should be able to confirm this.

Side note: in general we recommend starting with example.yaml and copying in options from full.yaml as you need them. The values configured in full.yaml are all the defaults, so unless you change something it is much clearer — and reduces the chances of accidentally doing something you don't mean — to omit those options.

0 replies

Linux-cpp-lisp · 2021-08-24T22:07:47Z

Linux-cpp-lisp
Aug 24, 2021
Maintainer

Forwarding some comments:

for periodic systems we found learning rates of 0.005 or 0.001 to work better

Your batch size may also be too small.

Out of curiosity, what units is this in?

0 replies

tbraeckevelt · 2021-08-25T08:59:20Z

tbraeckevelt
Aug 25, 2021
Author

Thanks for the tips.

These results are in the units of ASE, so eV/Angstrom.

EDIT: it will take a week before I know if this will resolve the issue.

0 replies

simonbatzner · 2021-08-28T08:52:34Z

simonbatzner
Aug 28, 2021
Maintainer

Hi @tbraeckevelt, sorry for the late reply, a few additional notes to Alby's great answer:

yes, this looks like your learning rate is too large: for periodic systems, I've found a LR of 0.005 to work better, keep the batch size low though (1 or 5 worked well for me)
you're right, typically after a bump you recover to the previous validation loss and keep improving --> plot the training loss on the same axis and have a look at the learning rate; my guess is that your LR scheduler based on patience has kicked in during the bump. Obvious solution is to increase the lr patience
I don't know the details of your system, but 20meV/A already looks quite promising, congrats :)

0 replies

tbraeckevelt · 2021-09-02T17:09:04Z

tbraeckevelt
Sep 2, 2021
Author

Thanks for the answer.

I tested your suggestions: learning rate (LR) 0.005 and 0.001, and batch size (BS) 10 and 15:

I come to the same conclusions as you did:

Increasing batch size did nothing to resolve the issue, it only made the training slower.
Decreasing the learning rate did lower (or remove) those peaks, but this appears to be a trade-off because LR=0.001 really slowed down the training. LR=0.005 lead to the lowest MAE, but it did still show a peak, although not to high (only lost two hours to get back to the initial MAE before the peak). LR=0.01 is a bit faster, but also a gamble to not encouter such a huge peak.

I will do some more test on other systems, but for now I will use LR=0.005 as default.

1 reply

simonbatzner Sep 2, 2021
Maintainer

Thank for the rigorous tests @tbraeckevelt, this closely matches what I see in training and why I tend to opt for LR-0.005.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge bumps in learning curves #74

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Huge bumps in learning curves #74

tbraeckevelt Aug 24, 2021

Replies: 5 comments · 1 reply

Linux-cpp-lisp Aug 24, 2021 Maintainer

Linux-cpp-lisp Aug 24, 2021 Maintainer

tbraeckevelt Aug 25, 2021 Author

simonbatzner Aug 28, 2021 Maintainer

tbraeckevelt Sep 2, 2021 Author

simonbatzner Sep 2, 2021 Maintainer

tbraeckevelt
Aug 24, 2021

Replies: 5 comments 1 reply

Linux-cpp-lisp
Aug 24, 2021
Maintainer

Linux-cpp-lisp
Aug 24, 2021
Maintainer

tbraeckevelt
Aug 25, 2021
Author

simonbatzner
Aug 28, 2021
Maintainer

tbraeckevelt
Sep 2, 2021
Author

simonbatzner Sep 2, 2021
Maintainer