You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I can confirm the model works for inference. However, I run into issues loading the state dictionary for training based on how the model was saved.
Here is the error in full:
(FullSubNet) $ torchrun --standalone --nnodes=1 --nproc_per_node=1 train.py -C fullsubnet/train.toml -R
1 process initialized.
Traceback (most recent call last):
File "/home/github/FullSubNet/recipes/dns_interspeech_2020/train.py", line 99, in <module>
entry(local_rank, configuration, args.resume, args.only_validation)
File "/home/github/FullSubNet/recipes/dns_interspeech_2020/train.py", line 59, in entry
trainer = trainer_class(
File "/home/github/FullSubNet/recipes/dns_interspeech_2020/fullsubnet/trainer.py", line 17, in __init__
super().__init__(dist, rank, config, resume, only_validation, model, loss_function, optimizer)
File "/home/github/FullSubNet/audio_zen/trainer/base_trainer.py", line 84, in __init__
self._resume_checkpoint()
File "/home/github/FullSubNet/audio_zen/trainer/base_trainer.py", line 153, in _resume_checkpoint
self.scaler.load_state_dict(checkpoint["scaler"])
File "/home/anaconda3/envs/FullSubNet/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 502, in load_state_dict
raise RuntimeError("The source state dict is empty, possibly because it was saved "
RuntimeError: The source state dict is empty, possibly because it was saved from a disabled instance of GradScaler.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1822537) of binary: /home/anaconda3/envs/FullSubNet/bin/python
Traceback (most recent call last):
File "/home/anaconda3/envs/FullSubNet/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/home/anaconda3/envs/FullSubNet/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/anaconda3/envs/FullSubNet/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/anaconda3/envs/FullSubNet/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/anaconda3/envs/FullSubNet/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/anaconda3/envs/FullSubNet/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-06-19_21:25:20
host : host-server
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1822537)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Are there specific modifications that need to be made to continue training?
Thank you for your help.
The text was updated successfully, but these errors were encountered:
Just following up again to this. Without a working baseline, we may have to train from scratch using your workflow to get a comparable. This is okay, but, when comparing different results to your paper, it may not be as good as fine-tuning your pre-trained model.
I am trying to continue training the pre-trained FullSubNet model provided by this repo:
fullsubnet_best_model_58epochs.tar
I can confirm the model works for inference. However, I run into issues loading the state dictionary for training based on how the model was saved.
Here is the error in full:
Are there specific modifications that need to be made to continue training?
Thank you for your help.
The text was updated successfully, but these errors were encountered: