Runtime per step linearly increases with training step number. #1322

iPRET · 2024-11-13T13:13:16Z

Describe the bug
Training step time is takes linearly more time with each training step.

To Reproduce
Steps to reproduce the behavior:
My config for running training was:
config.yml.txt
Used slurm for launching. Script run by sbatch was:
launch.sh.txt

Expected behavior
Runtime per train step linearly increases based on train step.
Plot of optimizer step time in ms, based on iteration [data parsed from training logs]:

Plot of forward step time in ms, based on iteration:

Plot of samples per second, based on iteration:

Environment:
Running code on the LUMI supercomputer.
GPUs: Single node run with 8x AMD MI250X GPUs.
Pip list: pip-list.txt
Python version: 3.6.15
Also modified deeperspeed to wrap each launched training process in a singularity container with srun.

Additional info:
Full training log:
slurm3.txt

Any ideas what might be the issue?

iPRET · 2024-11-14T07:51:06Z

Update: it did not replicate with the same config on a different server with NVIDIA L40s.
So I guess it's to do with something to do with the environment.

iPRET added the bug Something isn't working label Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime per step linearly increases with training step number. #1322

Runtime per step linearly increases with training step number. #1322

iPRET commented Nov 13, 2024 •

edited

Loading

iPRET commented Nov 14, 2024

Runtime per step linearly increases with training step number. #1322

Runtime per step linearly increases with training step number. #1322

Comments

iPRET commented Nov 13, 2024 • edited Loading

iPRET commented Nov 14, 2024

iPRET commented Nov 13, 2024 •

edited

Loading