Error early if we don't have enough disk space #154

achalddave · 2023-12-14T19:12:32Z

For medium-to-large models, if the user doesn't have enough disk space (or, more commonly, has accidentally specified a path on a volume with not enough disk space), we train for a full "epoch," and crash while saving the checkpoint. It would be nice to either:

Option 1: Save a dummy checkpoint at the very start, before training. If this succeeds, assume that future checkpoints will work if --delete-previous-checkpoint is specified. As an addition, we could check if there is num_checkpoints * size(initial checkpoint) disk space remaining if --delete-previous-checkpoint is not specified, but this is not necessary.

Option 2: Estimate the size of the checkpoint (based on number of parameters) and check if we have enough disk space based on number of checkpoints requested.

achalddave added the good first issue Good for newcomers label Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error early if we don't have enough disk space #154

Error early if we don't have enough disk space #154

achalddave commented Dec 14, 2023

Error early if we don't have enough disk space #154

Error early if we don't have enough disk space #154

Comments

achalddave commented Dec 14, 2023