Optimizing Batch-size and Learning-rate schedules for distributed computing #1611

sck-at-ucy · 2024-11-21T16:10:30Z

sck-at-ucy
Nov 21, 2024

I am finding (at least in the case of my model), that to get good scaling of distributed training performance when the number of nodes is 3 or more, I have to use schedules for both the batch size and the learning rate. I am finding that starting with a smaller batch size during the first few epochs, when the grads are changing faster, helps reduce the loss quickly. I double the batch-size every n epochs up to a predefined max batch size and then keep it fixed from that point on.

This seems to work well, but I am looking for ways to avoid a "trial and error" approach. If the model or the dataset is changed I would likely have to repeat the process all over again and the time spent finding near optimal schedules for batch size and learning rate kind of negates the distributed training benefit.

Thus, I was wondering if tools based on Bayesian optimization, for example Optuna, could be used to streamline this process? Before investing time to move this direction, I would be happy to hear the advice of @awni and @angeloskath. Is this worthwhile? Are there better solutions to this issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing Batch-size and Learning-rate schedules for distributed computing #1611

{{title}}

Replies: 0 comments

Select a reply

Optimizing Batch-size and Learning-rate schedules for distributed computing #1611

sck-at-ucy Nov 21, 2024

Replies: 0 comments

sck-at-ucy
Nov 21, 2024