Encountering "Expected to have finished reduction in the prior iteration before starting a new one" Error During Training #32

LemonWei111 · 2025-02-14T02:51:10Z

I'm encountering an issue while attempting to finetune a model on my dataset. I run the command as follows:
torchrun --master_port=7777 --nproc_per_node=1 train.py -c configs/deim_dfine/deim_hgnetv2_l_coco.yml --use-amp --seed 42 -t deim_dfine_hgnetv2_l_coco_50e.pth

The error message I receive is as follows:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/env1/DEIM/train.py", line 95, in <module>
[rank0]:     main(args)
[rank0]:   File "/home/env1/DEIM/train.py", line 65, in main
[rank0]:     solver.fit()
[rank0]:   File "/home/env1/DEIM/engine/solver/det_solver.py", line 76, in fit
[rank0]:     train_stats = train_one_epoch(
[rank0]:   File "/home/env1/DEIM/engine/solver/det_engine.py", line 58, in train_one_epoch
[rank0]:     outputs = model(samples, targets=targets)
[rank0]:   File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1639, in forward
[rank0]:     inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank0]:   File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1528, in _pre_forward
[rank0]:     if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank0]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
[rank0]: making sure all `forward` function outputs participate in calculating loss. 
[rank0]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameter indices which did not receive grad for rank 0: 384
[rank0]:  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
E0214 10:18:44.597429 2431079 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2431105) of binary: /home/anaconda3/envs/env1/bin/python
Traceback (most recent call last):
  File "/home/anaconda3/envs/env1/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-14_10:18:44
  host      : llmserver
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2431105)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I've tried to set the environment variable TORCH_DISTRIBUTED_DEBUG=DETAIL for more information, but haven't found a clear solution.

I suspect this might be due to the complexity of the DEIM model's forward pass return values, which may prevent DDP from correctly tracking gradient updates for all parameters. If anyone has encountered a similar problem or has any suggestions, your help would be greatly appreciated!

The text was updated successfully, but these errors were encountered:

LemonWei111 · 2025-02-14T03:03:23Z

Meanwhile, on this custom dataset, the model began to converge to the local optimum around the 3rd epoch, and training almost stopped.

LemonWei111 · 2025-02-14T03:17:05Z

A specific parameter (decoder.denoising_class_embed.weight) does not receive gradients.

mirza298 · 2025-02-14T14:32:50Z

@LemonWei111 Have you solved the issue? I have the same problem now.

ShihuaHuang95 · 2025-03-04T03:35:26Z

Hi, thank you so much for your interest in our work!
In my experience, this kind of issue typically arises because some parameters are not involved in gradient computation.
Please confirm two things:

Have you made any modifications to the DEIM model? This includes adding new modules.
Are the number of classes aligned? For example, if you’re finetuning on a custom dataset with only 10 categories, but COCO defaults to 80, and you just followed the COCO dataset.

If you’ve confirmed that no changes have been made in these two areas, could you try training on COCO and see if the issue persists?
We’ve never encountered this problem with this code. If it still occurs, please provide more details so we can work together to resolve it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encountering "Expected to have finished reduction in the prior iteration before starting a new one" Error During Training #32

Encountering "Expected to have finished reduction in the prior iteration before starting a new one" Error During Training #32

LemonWei111 commented Feb 14, 2025 •

edited

Loading

LemonWei111 commented Feb 14, 2025

LemonWei111 commented Feb 14, 2025

mirza298 commented Feb 14, 2025

ShihuaHuang95 commented Mar 4, 2025 •

edited

Loading

Encountering "Expected to have finished reduction in the prior iteration before starting a new one" Error During Training #32

Encountering "Expected to have finished reduction in the prior iteration before starting a new one" Error During Training #32

Comments

LemonWei111 commented Feb 14, 2025 • edited Loading

LemonWei111 commented Feb 14, 2025

LemonWei111 commented Feb 14, 2025

mirza298 commented Feb 14, 2025

ShihuaHuang95 commented Mar 4, 2025 • edited Loading

LemonWei111 commented Feb 14, 2025 •

edited

Loading

ShihuaHuang95 commented Mar 4, 2025 •

edited

Loading