Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you share the training log of t2m_trans #7

Open
CDLCHOI opened this issue Apr 18, 2024 · 6 comments
Open

Can you share the training log of t2m_trans #7

CDLCHOI opened this issue Apr 18, 2024 · 6 comments

Comments

@CDLCHOI
Copy link

CDLCHOI commented Apr 18, 2024

Can you share the training log of t2m_trans?
I found it difficult to train t2m_trans.

@exitudio
Copy link
Owner

Hi,
Here is the log before the code cleanup. I am re-training to verify again.

2023-10-12 10:11:15,494 INFO {
    "batch_size": 512,
    "block_size": 51,
    "clip_dim": 512,
    "code_dim": 32,
    "dataname": "t2m",
    "decay_option": "all",
    "depth": 3,
    "dilation_growth_rate": 3,
    "down_t": 2,
    "drop_out_rate": 0.1,
    "embed_dim_gpt": 1024,
    "eval_iter": 5000,
    "exp_name": "HML3D_45_crsAtt1lyr_40breset",
    "ff_rate": 4,
    "fps": [
        20
    ],
    "gamma": 0.05,
    "if_maxtest": false,
    "lr": 0.0001,
    "lr_scheduler": [
        37500
    ],
    "mu": 0.99,
    "n_head_gpt": 16,
    "nb_code": 8192,
    "num_layers": 9,
    "optimizer": "adamw",
    "out_dir": "/home/epinyoan/git/MaskText2Motion/T2M-BD/output/t2m/2023-10-12-10-11-15_HML3D_45_crsAtt1lyr_40breset/",
    "output_emb_width": 512,
    "pkeep": 0.5,
    "print_iter": 200,
    "quantbeta": 1.0,
    "quantizer": "ema_reset",
    "resume_pth": "/home/epinyoan/git/MaskText2Motion/T2M-BD/output/vq/2023-07-19-04-17-17_12_VQVAE_20batchResetNRandom_8192_32/net_last.pth",
    "resume_trans": null,
    "seed": 123,
    "seq_len": 64,
    "stride_t": 2,
    "total_iter": 75000,
    "vq_act": "relu",
    "vq_dir": "/home/epinyoan/git/MaskText2Motion/T2M-BD/output/vq/2023-07-19-04-17-17_12_VQVAE_20batchResetNRandom_8192_32",
    "vq_name": "2023-07-19-04-17-17_12_VQVAE_20batchResetNRandom_8192_32",
    "warm_up_iter": 1000,
    "weight_decay": 1e-06,
    "width": 512
}
2023-10-12 12:17:31,567 INFO --> 	 Eva. Iter 5000 :, 
                FID. 0.7736 , 
                Diversity Real. 9.4222, 
                Diversity. 9.4605, 
                R_precision_real. [0.51396277 0.70146277 0.78856383], 
                R_precision. [0.48271277 0.67021277 0.76462766], 
                matching_score_real. 2.903815122360879, 
                matching_score_pred. 3.2019664936877312, 
                multimodality. 0.0000
2023-10-12 12:17:31,568 INFO --> --> 	 FID Improved from 1000.00000 to 0.77361 !!!
2023-10-12 12:17:31,568 INFO --> --> 	 matching_score Improved from 100.00000 to 3.20197 !!!
2023-10-12 12:17:31,569 INFO --> --> 	 Diversity Improved from 100.00000 to 9.46046 !!!
2023-10-12 12:17:31,569 INFO --> --> 	 Top1 Improved from 0.0000 to 0.4827 !!!
2023-10-12 12:17:31,569 INFO --> --> 	 Top2 Improved from 0.0000 to 0.6702 !!!
2023-10-12 12:17:31,569 INFO --> --> 	 Top3 Improved from 0.0000 to 0.7646 !!!
2023-10-12 14:23:16,358 INFO --> 	 Eva. Iter 10000 :, 
                FID. 0.2061 , 
                Diversity Real. 9.2730, 
                Diversity. 9.6214, 
                R_precision_real. [0.52859043 0.71476064 0.80851064], 
                R_precision. [0.49069149 0.72074468 0.81515957], 
                matching_score_real. 2.8495587592429303, 
                matching_score_pred. 2.8929001118274447, 
                multimodality. 0.0000
2023-10-12 14:23:16,358 INFO --> --> 	 FID Improved from 0.77361 to 0.20613 !!!
2023-10-12 14:23:16,358 INFO --> --> 	 matching_score Improved from 3.20197 to 2.89290 !!!
2023-10-12 14:23:16,359 INFO --> --> 	 Top1 Improved from 0.4827 to 0.4907 !!!
2023-10-12 14:23:16,359 INFO --> --> 	 Top2 Improved from 0.6702 to 0.7207 !!!
2023-10-12 14:23:16,359 INFO --> --> 	 Top3 Improved from 0.7646 to 0.8152 !!!
2023-10-12 16:28:48,441 INFO --> 	 Eva. Iter 15000 :, 
                FID. 0.1951 , 
                Diversity Real. 9.5439, 
                Diversity. 9.8876, 
                R_precision_real. [0.51396277 0.70212766 0.81117021], 
                R_precision. [0.50797872 0.71409574 0.79720745], 
                matching_score_real. 2.8992625551020845, 
                matching_score_pred. 2.8141087217533842, 
                multimodality. 0.0000
2023-10-12 16:28:48,442 INFO --> --> 	 FID Improved from 0.20613 to 0.19505 !!!
2023-10-12 16:28:48,443 INFO --> --> 	 matching_score Improved from 2.89290 to 2.81411 !!!
2023-10-12 16:28:48,443 INFO --> --> 	 Top1 Improved from 0.4907 to 0.5080 !!!
2023-10-12 18:34:06,733 INFO --> 	 Eva. Iter 20000 :, 
                FID. 0.1696 , 
                Diversity Real. 9.7836, 
                Diversity. 9.7076, 
                R_precision_real. [0.52726064 0.71476064 0.79920213], 
                R_precision. [0.5206117  0.71276596 0.81316489], 
                matching_score_real. 2.8873865046399705, 
                matching_score_pred. 2.8751501681956837, 
                multimodality. 0.0000
2023-10-12 18:34:06,734 INFO --> --> 	 FID Improved from 0.19505 to 0.16957 !!!
2023-10-12 18:34:06,734 INFO --> --> 	 Diversity Improved from 9.46046 to 9.70763 !!!
2023-10-12 18:34:06,734 INFO --> --> 	 Top1 Improved from 0.5080 to 0.5206 !!!
2023-10-12 20:39:25,504 INFO --> 	 Eva. Iter 25000 :, 
                FID. 0.1105 , 
                Diversity Real. 9.6543, 
                Diversity. 9.7581, 
                R_precision_real. [0.50930851 0.69946809 0.78523936], 
                R_precision. [0.51795213 0.70545213 0.79654255], 
                matching_score_real. 2.9209979138475783, 
                matching_score_pred. 2.8628295431745814, 
                multimodality. 0.0000
2023-10-12 20:39:25,505 INFO --> --> 	 FID Improved from 0.16957 to 0.11046 !!!
2023-10-12 22:44:55,688 INFO --> 	 Eva. Iter 30000 :, 
                FID. 0.1699 , 
                Diversity Real. 9.3835, 
                Diversity. 9.3433, 
                R_precision_real. [0.5099734  0.70013298 0.79853723], 
                R_precision. [0.52393617 0.70146277 0.80851064], 
                matching_score_real. 2.86837160333674, 
                matching_score_pred. 2.8496022833154555, 
                multimodality. 0.0000
2023-10-12 22:44:55,689 INFO --> --> 	 Diversity Improved from 9.70763 to 9.34333 !!!
2023-10-12 22:44:55,689 INFO --> --> 	 Top1 Improved from 0.5206 to 0.5239 !!!
2023-10-13 00:50:39,065 INFO --> 	 Eva. Iter 35000 :, 
                FID. 0.1504 , 
                Diversity Real. 9.4453, 
                Diversity. 9.8359, 
                R_precision_real. [0.5099734  0.70079787 0.80984043], 
                R_precision. [0.50132979 0.69946809 0.7918883 ], 
                matching_score_real. 2.9201048739412996, 
                matching_score_pred. 2.9048643112182617, 
                multimodality. 0.0000
2023-10-13 02:56:09,232 INFO --> 	 Eva. Iter 40000 :, 
                FID. 0.1146 , 
                Diversity Real. 9.6758, 
                Diversity. 9.6304, 
                R_precision_real. [0.52460106 0.70611702 0.79654255], 
                R_precision. [0.54787234 0.71010638 0.80851064], 
                matching_score_real. 2.9239063770213027, 
                matching_score_pred. 2.7912252811675375, 
                multimodality. 0.0000
2023-10-13 02:56:09,233 INFO --> --> 	 matching_score Improved from 2.81411 to 2.79123 !!!
2023-10-13 02:56:09,233 INFO --> --> 	 Diversity Improved from 9.34333 to 9.63041 !!!
2023-10-13 02:56:09,233 INFO --> --> 	 Top1 Improved from 0.5239 to 0.5479 !!!
2023-10-13 05:01:32,842 INFO --> 	 Eva. Iter 45000 :, 
                FID. 0.1257 , 
                Diversity Real. 9.6170, 
                Diversity. 9.7833, 
                R_precision_real. [0.51462766 0.69082447 0.79787234], 
                R_precision. [0.53125    0.73271277 0.82978723], 
                matching_score_real. 2.914106272636576, 
                matching_score_pred. 2.757748162492793, 
                multimodality. 0.0000
2023-10-13 05:01:32,843 INFO --> --> 	 matching_score Improved from 2.79123 to 2.75775 !!!
2023-10-13 05:01:32,843 INFO --> --> 	 Top2 Improved from 0.7207 to 0.7327 !!!
2023-10-13 05:01:32,843 INFO --> --> 	 Top3 Improved from 0.8152 to 0.8298 !!!
2023-10-13 07:06:57,419 INFO --> 	 Eva. Iter 50000 :, 
                FID. 0.1295 , 
                Diversity Real. 9.4078, 
                Diversity. 9.3965, 
                R_precision_real. [0.51595745 0.70013298 0.79454787], 
                R_precision. [0.52526596 0.7287234  0.81981383], 
                matching_score_real. 2.8886835473649044, 
                matching_score_pred. 2.776045175308877, 
                multimodality. 0.0000
2023-10-13 07:06:57,420 INFO --> --> 	 Diversity Improved from 9.63041 to 9.39650 !!!
2023-10-13 09:12:30,226 INFO --> 	 Eva. Iter 55000 :, 
                FID. 0.1354 , 
                Diversity Real. 9.4323, 
                Diversity. 9.9881, 
                R_precision_real. [0.50797872 0.70545213 0.79454787], 
                R_precision. [0.52726064 0.71941489 0.81183511], 
                matching_score_real. 2.9032556199012918, 
                matching_score_pred. 2.848728185004376, 
                multimodality. 0.0000
2023-10-13 11:18:25,270 INFO --> 	 Eva. Iter 60000 :, 
                FID. 0.1618 , 
                Diversity Real. 9.4185, 
                Diversity. 9.8101, 
                R_precision_real. [0.51728723 0.71143617 0.80518617], 
                R_precision. [0.53856383 0.73404255 0.81981383], 
                matching_score_real. 2.9068394011639533, 
                matching_score_pred. 2.769266184340132, 
                multimodality. 0.0000
2023-10-13 11:18:25,270 INFO --> --> 	 Top2 Improved from 0.7327 to 0.7340 !!!
2023-10-13 13:24:16,032 INFO --> 	 Eva. Iter 65000 :, 
                FID. 0.1228 , 
                Diversity Real. 9.4212, 
                Diversity. 9.4895, 
                R_precision_real. [0.51928191 0.6974734  0.79321809], 
                R_precision. [0.52726064 0.73138298 0.82579787], 
                matching_score_real. 2.9037675553179803, 
                matching_score_pred. 2.7725830686853286, 
                multimodality. 0.0000
2023-10-13 15:29:53,293 INFO --> 	 Eva. Iter 70000 :, 
                FID. 0.0969 , 
                Diversity Real. 9.9374, 
                Diversity. 9.5377, 
                R_precision_real. [0.53457447 0.71343085 0.80385638], 
                R_precision. [0.52526596 0.73271277 0.82845745], 
                matching_score_real. 2.8475664473594504, 
                matching_score_pred. 2.8103004364257163, 
                multimodality. 0.0000
2023-10-13 15:29:53,294 INFO --> --> 	 FID Improved from 0.11046 to 0.09693 !!!
2023-10-13 15:29:53,294 INFO --> --> 	 Diversity Improved from 9.39650 to 9.53765 !!!
2023-10-13 18:34:33,146 INFO --> 	 Eva. Iter 75000 :, 
                FID. 0.0914 , 
                Diversity Real. 9.6473, 
                Diversity. 9.7987, 
                R_precision_real. [0.50560345 0.70150862 0.79935345], 
                R_precision. [0.51198276 0.70967672 0.80681034], 
                matching_score_real. 2.9873770861790097, 
                matching_score_pred. 2.909082286440093, 
                multimodality. 1.1684
2023-10-13 18:34:33,147 INFO --> --> 	 FID Improved from 0.09693 to 0.09137 !!!
2023-10-13 18:34:34,787 INFO Train. Iter 75000 : FID. 0.09137, Diversity. 9.5377, TOP1. 0.5479, TOP2. 0.7340, TOP3. 0.8298

@CDLCHOI
Copy link
Author

CDLCHOI commented Apr 19, 2024

Thanks.
I load your trans.pth , and finetune with lr=1e-4, at the begining of training, loss≈2.2, acc≈60%。

  1. And I keep training, at the end the loss go down to <0.5, acc to 80%+ , did you choose the middle checkpoint as the trans.pth?
  2. And I test the loss 0.5 acc 80%+ checkpoint, its FID and R_precision is obviously worse than the trans.pth your offered. Why does this happen

@exitudio
Copy link
Owner

  1. No. It's save from the last epoch.
  2. If you intend to train a Transformer, only load the pretrained vqvae.pth model. Perhaps it has been overtrained.

@CDLCHOI
Copy link
Author

CDLCHOI commented Apr 19, 2024

But I think my model is not overtrained.
I train from scratch again with your code in #6 .
python3 train_t2m_trans.py
--exp-name validation_train
--batch-size 128
--vq-name pretrain
--out-dir output/test
--total-iter 300000
--lr-scheduler 150000
--dataname t2m
--eval-iter 20000

and this is part of my log(the code didn't eval )
2024-04-19 09:52:15,403 INFO Train. Iter 10 : Loss. 8.96595,acc:0.58339
2024-04-19 10:13:31,970 INFO Train. Iter 4000 : Loss. 5.02665,acc:10.64897
2024-04-19 10:34:56,329 INFO Train. Iter 8000 : Loss. 3.43890,acc:25.89595
2024-04-19 10:56:53,287 INFO Train. Iter 12000 : Loss. 2.06721,acc:43.09161
2024-04-19 11:46:31,417 INFO Train. Iter 20000 : Loss. 1.05492,acc:66.60687
2024-04-19 12:07:59,340 INFO Train. Iter 24000 : Loss. 1.05645,acc:68.19553
2024-04-19 12:40:16,859 INFO Train. Iter 30000 : Loss. 0.82608,acc:77.09264
2024-04-19 15:22:51,580 INFO Train. Iter 32290 : Loss. 0.75970,acc:69.74814

and this is part of log finetuning with "resume_trans": "pretrain/trans.pth",
"resume_pth": "./output/vq/vq_name/net_last.pth",
"resume_trans": "pretrain/trans.pth",
"root_dist_loss": false,
"root_loss_no_vel_rot": false,
"save_iter": 2000,
"seed": 123,
"seq_len": 64,
"stride_t": 2,
"temporal_complete": 0.0,
"text": null,
"total_iter": 30000,
"traj_supervise": false,
"vq_act": "relu",
"vq_dir": "./output/vq/vq_name",
"vq_name": "output/vq/vq_name",
"warm_up_iter": 1000,
"weight_decay": 1e-06,
"width": 512,
"xyz_type": "all"
}
2024-04-18 12:05:16,756 INFO Train. Iter 20 : Loss. 2.88080
2024-04-18 12:05:20,199 INFO Train. Iter 40 : Loss. 2.79312
2024-04-18 12:05:23,651 INFO Train. Iter 60 : Loss. 2.79212
2024-04-18 12:08:05,232 INFO Train. Iter 1000 : Loss. 2.09924
2024-04-18 12:10:58,100 INFO Train. Iter 2000 : Loss. 1.39754
2024-04-18 12:16:39,970 INFO Train. Iter 3980 : Loss. 1.06009
2024-04-18 12:16:43,554 INFO Train. Iter 4000 : Loss. 1.02272
2024-04-18 12:22:30,369 INFO Train. Iter 6000 : Loss. 0.94934
2024-04-18 12:28:35,686 INFO Train. Iter 8100 : Loss. 0.77994
2024-04-18 12:35:36,789 INFO Train. Iter 10380 : Loss. 0.64359
2024-04-18 12:35:40,204 INFO Train. Iter 10400 : Loss. 0.63709
2024-04-18 12:43:06,749 INFO Train. Iter 13000 : Loss. 0.47693
2024-04-18 12:45:58,573 INFO Train. Iter 14000 : Loss. 0.44967

I'm confused about the 2 results.
Do you know the reason? Or can you finetuning with trans.pth for a little while with lr=1e-4.

Sorry to bother you again.

@exitudio
Copy link
Owner

Can you try to train longer? It seems like you're using a batch size of 128 with 30,000 epochs.

I use batch size of 512 with 75,000 epochs. I also tried batch size 128 with more epochs, 300,000 epochs (to make total data the same). These 2 settings have similar results.

If you have any further questions, please don't hesitate to ask. I'll do my best to assist you.

@CDLCHOI
Copy link
Author

CDLCHOI commented Apr 20, 2024

Thank you so much. I will try again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants