start_index not getting reset in data loader when moving to new epoch #650

leon-g-xu · 2024-07-10T16:16:25Z

🐛 Describe the bug

When a training job resumes from a checkpoint, it resumes from the epoch and start_index saved in the checkpoint.
The start_index is being set in the data loader.
However this start_index does not get reset to 0 when the current epoch finishes and next epoch starts. So new epoch will still read the data from the old start_index.

start_index loaded from checkpoint: https://github.com/allenai/OLMo/blob/main/olmo/train.py#L377
how start_index is used in data loader(and it didn't get reset) : https://github.com/allenai/OLMo/blob/main/olmo/data/iterable_dataset.py#L133-L135

Versions

olmo 0.3.0

leon-g-xu · 2024-07-10T23:39:53Z

One solution is to reset the start index to be 0 on the next epoch. I am not sure if there's any setting that I missed.

AkshitaB · 2024-07-29T16:20:12Z

@epwalsh I believe you already fixed this. Can you confirm?

leon-g-xu · 2024-07-29T16:37:41Z

If this is already fixed, can you share the commit/PR that fixes this?

epwalsh · 2024-07-29T17:25:45Z

Yeup, fixed here: a3e2ea7

leon-g-xu added the type/bug An issue about a bug label Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

start_index not getting reset in data loader when moving to new epoch #650

start_index not getting reset in data loader when moving to new epoch #650

leon-g-xu commented Jul 10, 2024 •

edited

Loading

leon-g-xu commented Jul 10, 2024

AkshitaB commented Jul 29, 2024

leon-g-xu commented Jul 29, 2024

epwalsh commented Jul 29, 2024

start_index not getting reset in data loader when moving to new epoch #650

start_index not getting reset in data loader when moving to new epoch #650

Comments

leon-g-xu commented Jul 10, 2024 • edited Loading

🐛 Describe the bug

Versions

leon-g-xu commented Jul 10, 2024

AkshitaB commented Jul 29, 2024

leon-g-xu commented Jul 29, 2024

epwalsh commented Jul 29, 2024

leon-g-xu commented Jul 10, 2024 •

edited

Loading