You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a training job resumes from a checkpoint, it resumes from the epoch and start_index saved in the checkpoint.
The start_index is being set in the data loader.
However this start_index does not get reset to 0 when the current epoch finishes and next epoch starts. So new epoch will still read the data from the old start_index.
🐛 Describe the bug
When a training job resumes from a checkpoint, it resumes from the epoch and start_index saved in the checkpoint.
The start_index is being set in the data loader.
However this start_index does not get reset to 0 when the current epoch finishes and next epoch starts. So new epoch will still read the data from the old start_index.
start_index loaded from checkpoint: https://github.com/allenai/OLMo/blob/main/olmo/train.py#L377
how start_index is used in data loader(and it didn't get reset) : https://github.com/allenai/OLMo/blob/main/olmo/data/iterable_dataset.py#L133-L135
Versions
olmo 0.3.0
The text was updated successfully, but these errors were encountered: