Distributed training forthe RCL task #24

JadinTredupLP · 2022-01-25T19:07:14Z

Hello, I am trying to pretrain todbert on my own dataset, but because of the size of the dataset I need to distribute training to speed up computation. It seems like distributed training is built into the MLM task, but distributing the RCL task throws an error. We have written some to distribute the RCL task but our training results show little-to-no improvement on the RS loss vs the single-GPU case. I am wondering if there is any specific reason you decided not to distribute the RCL task over multiple GPUs or a problem you encountered, or if there is just likely a bug in our code.

jasonwu0731 · 2022-01-25T22:45:53Z

Hi,

Can you provide what is the error when you run the RCL training? We did not focus too much on parallel training at that time and used the huggingface implementation for that.

JadinTredupLP · 2022-01-26T17:54:01Z

I am not getting an error really, just the RS loss is not decreasing when it gets distributed. On a single GPU it converges fine (for a small amount of data) and for the same amount the distributed training did not converge at all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training forthe RCL task #24

Distributed training forthe RCL task #24

JadinTredupLP commented Jan 25, 2022

jasonwu0731 commented Jan 25, 2022

JadinTredupLP commented Jan 26, 2022

Distributed training forthe RCL task #24

Distributed training forthe RCL task #24

Comments

JadinTredupLP commented Jan 25, 2022

jasonwu0731 commented Jan 25, 2022

JadinTredupLP commented Jan 26, 2022