Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using the pre-trained MT-DNN ckpt, the training loss does not converge #212

Open
heshenghuan opened this issue May 12, 2021 · 4 comments

Comments

@heshenghuan
Copy link

The init_checkpoint is trained by using scripts/run_mt_dnn.sh, and only the random seed changed, the train loss:

05/12/2021 10:37:28 Total number of params: 110085136
05/12/2021 10:37:28 At epoch 0
05/12/2021 10:37:29 Task [ 1] updates[     1] train loss[0.11215] remaining[6:25:06]
05/12/2021 10:39:55 Task [ 1] updates[   500] train loss[0.07299] remaining[2:22:14]
05/12/2021 10:42:20 Task [ 4] updates[  1000] train loss[0.07097] remaining[2:19:29]
05/12/2021 10:44:45 Task [ 1] updates[  1500] train loss[0.07218] remaining[2:16:38]
05/12/2021 10:47:10 Task [ 4] updates[  2000] train loss[0.07355] remaining[2:14:01]
05/12/2021 10:49:37 Task [ 3] updates[  2500] train loss[0.07262] remaining[2:11:59]
05/12/2021 10:52:06 Task [ 1] updates[  3000] train loss[0.07277] remaining[2:10:00]
05/12/2021 10:54:33 Task [ 3] updates[  3500] train loss[0.07201] remaining[2:07:40]
05/12/2021 10:57:00 Task [ 1] updates[  4000] train loss[0.07130] remaining[2:05:15]
05/12/2021 10:59:25 Task [ 4] updates[  4500] train loss[0.07161] remaining[2:02:43]
05/12/2021 11:01:50 Task [ 4] updates[  5000] train loss[0.07407] remaining[2:00:12]
05/12/2021 11:04:19 Task [ 4] updates[  5500] train loss[0.09848] remaining[1:57:56]
05/12/2021 11:06:44 Task [ 6] updates[  6000] train loss[0.23665] remaining[1:55:25]
05/12/2021 11:09:10 Task [ 1] updates[  6500] train loss[0.51240] remaining[1:52:58]
05/12/2021 11:11:37 Task [ 4] updates[  7000] train loss[0.82062] remaining[1:50:33]
05/12/2021 11:14:05 Task [ 1] updates[  7500] train loss[1.12256] remaining[1:48:10]
05/12/2021 11:16:29 Task [ 4] updates[  8000] train loss[1.40271] remaining[1:45:38]
05/12/2021 11:18:53 Task [ 1] updates[  8500] train loss[1.67136] remaining[1:43:05]
05/12/2021 11:21:19 Task [ 4] updates[  9000] train loss[1.93972] remaining[1:40:39]
05/12/2021 11:23:46 Task [ 1] updates[  9500] train loss[2.20053] remaining[1:38:15]
05/12/2021 11:26:12 Task [ 4] updates[ 10000] train loss[2.43337] remaining[1:35:49]
05/12/2021 11:28:38 Task [ 1] updates[ 10500] train loss[2.67121] remaining[1:33:21]
05/12/2021 11:31:01 Task [ 4] updates[ 11000] train loss[2.90422] remaining[1:30:51]
05/12/2021 11:33:27 Task [ 4] updates[ 11500] train loss[3.13861] remaining[1:28:24]
05/12/2021 11:35:52 Task [ 1] updates[ 12000] train loss[3.35614] remaining[1:25:57]
05/12/2021 11:38:16 Task [ 4] updates[ 12500] train loss[3.56738] remaining[1:23:28]
05/12/2021 11:40:42 Task [ 4] updates[ 13000] train loss[3.78074] remaining[1:21:03]
05/12/2021 11:43:07 Task [ 1] updates[ 13500] train loss[3.98890] remaining[1:18:35]
05/12/2021 11:45:33 Task [ 1] updates[ 14000] train loss[4.19893] remaining[1:16:09]
05/12/2021 11:47:57 Task [ 6] updates[ 14500] train loss[4.38943] remaining[1:13:42]
05/12/2021 11:50:25 Task [ 1] updates[ 15000] train loss[4.59797] remaining[1:11:18]
05/12/2021 11:52:51 Task [ 4] updates[ 15500] train loss[4.79801] remaining[1:08:52]
05/12/2021 11:55:15 Task [ 1] updates[ 16000] train loss[4.99510] remaining[1:06:25]
05/12/2021 11:57:41 Task [ 1] updates[ 16500] train loss[5.18970] remaining[1:03:59]
05/12/2021 12:00:07 Task [ 3] updates[ 17000] train loss[5.38753] remaining[1:01:33]
@namisan
Copy link
Owner

namisan commented May 12, 2021

Learning rate is too big?

@heshenghuan
Copy link
Author

heshenghuan commented May 13, 2021

Learning rate is too big?

It seems like a random seed problem. If default random seed was chosen, the training loss increasing problem appeared(even with a smaller learning rate). Choose another random seed is useful for this problem.

@heshenghuan
Copy link
Author

Another issuse:

mt-dnn/train.py

Line 398 in 471f717

opt.update(config)

This line seems will overwrite the training params when using a pre-trained mt-dnn model.

@namisan
Copy link
Owner

namisan commented Nov 13, 2021

@heshenghuan , I reran the script with different random seeds and didnot hit the bug as your mentioned.
I'm wondering which pretrained model is used in your experiments.

Yes, config in mt-dnn should be the same as the pretrained config. If I remember correctly, I removed other unrelated args.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants