Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountering "Expected to have finished reduction in the prior iteration before starting a new one" Error During Training #32

Open
LemonWei111 opened this issue Feb 14, 2025 · 4 comments

Comments

@LemonWei111
Copy link

LemonWei111 commented Feb 14, 2025

I'm encountering an issue while attempting to finetune a model on my dataset. I run the command as follows:
torchrun --master_port=7777 --nproc_per_node=1 train.py -c configs/deim_dfine/deim_hgnetv2_l_coco.yml --use-amp --seed 42 -t deim_dfine_hgnetv2_l_coco_50e.pth

The error message I receive is as follows:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/env1/DEIM/train.py", line 95, in <module>
[rank0]:     main(args)
[rank0]:   File "/home/env1/DEIM/train.py", line 65, in main
[rank0]:     solver.fit()
[rank0]:   File "/home/env1/DEIM/engine/solver/det_solver.py", line 76, in fit
[rank0]:     train_stats = train_one_epoch(
[rank0]:   File "/home/env1/DEIM/engine/solver/det_engine.py", line 58, in train_one_epoch
[rank0]:     outputs = model(samples, targets=targets)
[rank0]:   File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1639, in forward
[rank0]:     inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank0]:   File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1528, in _pre_forward
[rank0]:     if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank0]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
[rank0]: making sure all `forward` function outputs participate in calculating loss. 
[rank0]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameter indices which did not receive grad for rank 0: 384
[rank0]:  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
E0214 10:18:44.597429 2431079 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2431105) of binary: /home/anaconda3/envs/env1/bin/python
Traceback (most recent call last):
  File "/home/anaconda3/envs/env1/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-14_10:18:44
  host      : llmserver
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2431105)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I've tried to set the environment variable TORCH_DISTRIBUTED_DEBUG=DETAIL for more information, but haven't found a clear solution.

I suspect this might be due to the complexity of the DEIM model's forward pass return values, which may prevent DDP from correctly tracking gradient updates for all parameters. If anyone has encountered a similar problem or has any suggestions, your help would be greatly appreciated!

@LemonWei111
Copy link
Author

Meanwhile, on this custom dataset, the model began to converge to the local optimum around the 3rd epoch, and training almost stopped.

@LemonWei111
Copy link
Author

A specific parameter (decoder.denoising_class_embed.weight) does not receive gradients.

@mirza298
Copy link

@LemonWei111 Have you solved the issue? I have the same problem now.

@ShihuaHuang95
Copy link
Owner

ShihuaHuang95 commented Mar 4, 2025

Hi, thank you so much for your interest in our work!
In my experience, this kind of issue typically arises because some parameters are not involved in gradient computation.
Please confirm two things:

  1. Have you made any modifications to the DEIM model? This includes adding new modules.
  2. Are the number of classes aligned? For example, if you’re finetuning on a custom dataset with only 10 categories, but COCO defaults to 80, and you just followed the COCO dataset.

If you’ve confirmed that no changes have been made in these two areas, could you try training on COCO and see if the issue persists?
We’ve never encountered this problem with this code. If it still occurs, please provide more details so we can work together to resolve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants