You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm encountering an issue while attempting to finetune a model on my dataset. I run the command as follows: torchrun --master_port=7777 --nproc_per_node=1 train.py -c configs/deim_dfine/deim_hgnetv2_l_coco.yml --use-amp --seed 42 -t deim_dfine_hgnetv2_l_coco_50e.pth
The error message I receive is as follows:
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/env1/DEIM/train.py", line 95, in <module>
[rank0]: main(args)
[rank0]: File "/home/env1/DEIM/train.py", line 65, in main
[rank0]: solver.fit()
[rank0]: File "/home/env1/DEIM/engine/solver/det_solver.py", line 76, in fit
[rank0]: train_stats = train_one_epoch(
[rank0]: File "/home/env1/DEIM/engine/solver/det_engine.py", line 58, in train_one_epoch
[rank0]: outputs = model(samples, targets=targets)
[rank0]: File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1639, in forward
[rank0]: inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank0]: File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1528, in _pre_forward
[rank0]: if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank0]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
[rank0]: making sure all `forward` function outputs participate in calculating loss.
[rank0]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameter indices which did not receive grad for rank 0: 384
[rank0]: In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
E0214 10:18:44.597429 2431079 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2431105) of binary: /home/anaconda3/envs/env1/bin/python
Traceback (most recent call last):
File "/home/anaconda3/envs/env1/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/anaconda3/envs/env1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-14_10:18:44
host : llmserver
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2431105)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
I've tried to set the environment variable TORCH_DISTRIBUTED_DEBUG=DETAIL for more information, but haven't found a clear solution.
I suspect this might be due to the complexity of the DEIM model's forward pass return values, which may prevent DDP from correctly tracking gradient updates for all parameters. If anyone has encountered a similar problem or has any suggestions, your help would be greatly appreciated!
The text was updated successfully, but these errors were encountered:
Hi, thank you so much for your interest in our work!
In my experience, this kind of issue typically arises because some parameters are not involved in gradient computation.
Please confirm two things:
Have you made any modifications to the DEIM model? This includes adding new modules.
Are the number of classes aligned? For example, if you’re finetuning on a custom dataset with only 10 categories, but COCO defaults to 80, and you just followed the COCO dataset.
If you’ve confirmed that no changes have been made in these two areas, could you try training on COCO and see if the issue persists?
We’ve never encountered this problem with this code. If it still occurs, please provide more details so we can work together to resolve it.
I'm encountering an issue while attempting to finetune a model on my dataset. I run the command as follows:
torchrun --master_port=7777 --nproc_per_node=1 train.py -c configs/deim_dfine/deim_hgnetv2_l_coco.yml --use-amp --seed 42 -t deim_dfine_hgnetv2_l_coco_50e.pth
The error message I receive is as follows:
I've tried to set the environment variable TORCH_DISTRIBUTED_DEBUG=DETAIL for more information, but haven't found a clear solution.
I suspect this might be due to the complexity of the DEIM model's forward pass return values, which may prevent DDP from correctly tracking gradient updates for all parameters. If anyone has encountered a similar problem or has any suggestions, your help would be greatly appreciated!
The text was updated successfully, but these errors were encountered: