You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for developing such a useful package. I have encountered an issue with rerunning a cancelled experiment (XP). Specifically, I'm using dora grid baseline --clear with the expectation that the experiment would start from scratch. Initially, it appears to work as the history.json file is deleted. However, as the process continues, the previous history reappears before the current training starts, causing the training to resume from the last cancellation point.
This issue does not occur if the previous experiment completes successfully; in those cases, the --clear option works as expected. Could you advise on how to ensure that a cancelled experiment restarts completely from scratch when rerun?
Thank you for your assistance.
The text was updated successfully, but these errors were encountered:
This might happen if the previous experiment is still running and is not properly cancelled. Could you double check that the previous XP is not still running on the cluster ? It is normally cancelled but there based on the configuration of slurm, it is possible the experiment will continue running for 1 or 2 min.
One solution would be to first run dora grid baseline -C to cancel all the experiments, wait to see them indeed gone from the cluster, then running the --clear command.
❓ Questions
Hi @adefossez,
Thank you for developing such a useful package. I have encountered an issue with rerunning a cancelled experiment (XP). Specifically, I'm using dora grid baseline --clear with the expectation that the experiment would start from scratch. Initially, it appears to work as the history.json file is deleted. However, as the process continues, the previous history reappears before the current training starts, causing the training to resume from the last cancellation point.
This issue does not occur if the previous experiment completes successfully; in those cases, the --clear option works as expected. Could you advise on how to ensure that a cancelled experiment restarts completely from scratch when rerun?
Thank you for your assistance.
The text was updated successfully, but these errors were encountered: