You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Training step time is takes linearly more time with each training step.
To Reproduce
Steps to reproduce the behavior:
My config for running training was: config.yml.txt
Used slurm for launching. Script run by sbatch was: launch.sh.txt
Expected behavior
Runtime per train step linearly increases based on train step.
Plot of optimizer step time in ms, based on iteration [data parsed from training logs]:
Plot of forward step time in ms, based on iteration:
Plot of samples per second, based on iteration:
Environment:
Running code on the LUMI supercomputer.
GPUs: Single node run with 8x AMD MI250X GPUs.
Pip list: pip-list.txt
Python version: 3.6.15
Also modified deeperspeed to wrap each launched training process in a singularity container with srun.
Update: it did not replicate with the same config on a different server with NVIDIA L40s.
So I guess it's to do with something to do with the environment.
Describe the bug
Training step time is takes linearly more time with each training step.
To Reproduce
Steps to reproduce the behavior:
My config for running training was:
config.yml.txt
Used slurm for launching. Script run by sbatch was:
launch.sh.txt
Expected behavior
Runtime per train step linearly increases based on train step.
Plot of optimizer step time in ms, based on iteration [data parsed from training logs]:
Plot of forward step time in ms, based on iteration:
Plot of samples per second, based on iteration:
Environment:
Running code on the LUMI supercomputer.
GPUs: Single node run with 8x AMD MI250X GPUs.
Pip list: pip-list.txt
Python version: 3.6.15
Also modified deeperspeed to wrap each launched training process in a singularity container with srun.
Additional info:
Full training log:
slurm3.txt
Any ideas what might be the issue?
The text was updated successfully, but these errors were encountered: