You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I reproduce the data preprocess ,and then train the model with electra-large-discriminator plm / msde local_and_nonlocal strategy.
I found it takes me around 50 min per epoch on a Tesla V100 32G with the same hyper-parameters on paper
Besides, I do some modify to use the DDP with 4 GPU, but the time only reduce to 40 min per epoch
Is the same for your training time ?
I want to do some experiment with the LGESQL basemodel, but the time counsuming is .....[SAD]
The text was updated successfully, but these errors were encountered:
Sorry for the late reply. I just checked my experiment logs. It takes roughly 1200 seconds for training with large-series PLM per epoch, with running script in run/run_train_lgesql_plm.sh. Sadly, it is a little slow in your experiment. But, 200 epochs is not a necessity. Actually, 100 epochs is enough for comparable performances. We trained the large-series PLM for more epochs just for more stable performances according to this work.
If you just want to verify your ideas, why not experiment with Glove embeddings or base-series PLM? It can be much faster. We did not spend too much time on hyper-parameter tuning with large-series PLM and most ablation studies are conducted with Glove embeddings. As for how to set the grad_accumulate hyper-parameter (the actual mini-batch size for each forward process is batch_size/grad_accumulate), you can run and check the usage of CUDA memory based on your GPU device.
Attention: there is a mistake about learning rate in the original paper. Learning rate should be 1e-4 for large-series PLM and 2e-4 for base-series PLM (not e-5).
I reproduce the data preprocess ,and then train the model with electra-large-discriminator plm / msde local_and_nonlocal strategy.
I found it takes me around 50 min per epoch on a Tesla V100 32G with the same hyper-parameters on paper
Besides, I do some modify to use the DDP with 4 GPU, but the time only reduce to 40 min per epoch
Is the same for your training time ?
I want to do some experiment with the LGESQL basemodel, but the time counsuming is .....[SAD]
The text was updated successfully, but these errors were encountered: