Optimizing Batch-size and Learning-rate schedules for distributed computing #1611
sck-at-ucy
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am finding (at least in the case of my model), that to get good scaling of distributed training performance when the number of nodes is 3 or more, I have to use schedules for both the batch size and the learning rate. I am finding that starting with a smaller batch size during the first few epochs, when the grads are changing faster, helps reduce the loss quickly. I double the batch-size every n epochs up to a predefined max batch size and then keep it fixed from that point on.
This seems to work well, but I am looking for ways to avoid a "trial and error" approach. If the model or the dataset is changed I would likely have to repeat the process all over again and the time spent finding near optimal schedules for batch size and learning rate kind of negates the distributed training benefit.
Thus, I was wondering if tools based on Bayesian optimization, for example Optuna, could be used to streamline this process? Before investing time to move this direction, I would be happy to hear the advice of @awni and @angeloskath. Is this worthwhile? Are there better solutions to this issue?
Beta Was this translation helpful? Give feedback.
All reactions