How to resume training with correct learning rate #326

RamakrishnaChaitanya · 2025-03-05T06:18:03Z

RamakrishnaChaitanya
Mar 5, 2025

I'm training a fastpitch model from scratch on custom datasets. The initial learning rate (lr) that was set prior to training is 0.0001 i.e.,

"lr": 0.0001, "optimizer": "Adam", "optimizer_params": { "betas": [ 0.9, 0.998 ], "weight_decay": 1e-06 }, "lr_scheduler": "NoamLR", "lr_scheduler_params": { "warmup_steps": 4000 },

But the learning rate that got printed at the start of training was quite different i.e., 2.5000000000000002e-08 .

`�[4m�[1m > EPOCH: 0/1500�[0m
--> /data/Ramakrishna/Projects/TTS/Github_repos/coqui-ai-TTS/recipes/vctk/fast_pitch/fast_pitch_indic-March-03-2025_04+49PM-a2fb3662

�[1m > TRAINING (2025-03-03 16:52:55) �[0m

�[1m --> TIME: 2025-03-03 16:53:00 -- STEP: 0/686 -- GLOBAL_STEP: 0�[0m
| > loss_spec: 29.786182403564453 (29.786182403564453)
| > loss_dur: 0.1483408361673355 (0.1483408361673355)
| > loss_pitch: 0.051399536430835724 (0.051399536430835724)
| > loss_aligner: 10.167981147766113 (10.167981147766113)
| > loss_binary_alignment: 0.29392918944358826 (0.29392918944358826)
| > loss: 40.44783401489258 (40.44783401489258)
| > duration_error: 4.2506937980651855 (4.2506937980651855)
| > grad_norm: tensor(72.2665, device='cuda:0') (tensor(72.2665, device='cuda:0'))
| > current_lr: 2.5000000000000002e-08
| > step_time: 3.0655 (3.0655248165130615)
| > loader_time: 1.4772 (1.4772047996520996)`

I assume that it is happening due to the usage of warmup steps, but I'm not very sure. And later, due to Out of Memory exception, the training got stopped after 100 epochs. I would like to resume training the model with the updated learning rate (lr) that was used in 100th epoch i.e., 6.1 * e-6. However, when i tried executing this command

CUDA_VISIBLE_DEVICES="7" python train_fastpitch.py \ --restore_path path/model_file.pth --coqpit.lr 6.1*e-4

The learning rate that got printed in the latest epoch is again quite different i.e., 2.5000000000000002e-08. Can i get any help regarding how to resume the training with the updated learning rate?

Answered by eginhard

Mar 5, 2025

To continue a training run, you should use --continue_path path/ instead of --restore_path path/model.pth, that should correctly restore parameters.

View full answer

eginhard · 2025-03-05T10:25:55Z

eginhard
Mar 5, 2025
Maintainer

To continue a training run, you should use --continue_path path/ instead of --restore_path path/model.pth, that should correctly restore parameters.

1 reply

RamakrishnaChaitanya Mar 6, 2025
Author

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to resume training with correct learning rate #326

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to resume training with correct learning rate #326

RamakrishnaChaitanya Mar 5, 2025

Replies: 1 comment · 1 reply

eginhard Mar 5, 2025 Maintainer

RamakrishnaChaitanya Mar 6, 2025 Author

RamakrishnaChaitanya
Mar 5, 2025

Replies: 1 comment 1 reply

eginhard
Mar 5, 2025
Maintainer

RamakrishnaChaitanya Mar 6, 2025
Author