Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training forthe RCL task #24

Open
JadinTredupLP opened this issue Jan 25, 2022 · 2 comments
Open

Distributed training forthe RCL task #24

JadinTredupLP opened this issue Jan 25, 2022 · 2 comments

Comments

@JadinTredupLP
Copy link

Hello, I am trying to pretrain todbert on my own dataset, but because of the size of the dataset I need to distribute training to speed up computation. It seems like distributed training is built into the MLM task, but distributing the RCL task throws an error. We have written some to distribute the RCL task but our training results show little-to-no improvement on the RS loss vs the single-GPU case. I am wondering if there is any specific reason you decided not to distribute the RCL task over multiple GPUs or a problem you encountered, or if there is just likely a bug in our code.

@jasonwu0731
Copy link
Owner

Hi,

Can you provide what is the error when you run the RCL training? We did not focus too much on parallel training at that time and used the huggingface implementation for that.

@JadinTredupLP
Copy link
Author

I am not getting an error really, just the RS loss is not decreasing when it gets distributed. On a single GPU it converges fine (for a small amount of data) and for the same amount the distributed training did not converge at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants