Unable to fine-tune pre-trained model (fullsubnet_best_model_58epochs.tar) #48

jhkonan · 2022-06-20T01:34:34Z

I am trying to continue training the pre-trained FullSubNet model provided by this repo:

I can confirm the model works for inference. However, I run into issues loading the state dictionary for training based on how the model was saved.

Here is the error in full:

(FullSubNet) $ torchrun --standalone --nnodes=1 --nproc_per_node=1 train.py -C fullsubnet/train.toml -R
1 process initialized.
Traceback (most recent call last):
  File "/home/github/FullSubNet/recipes/dns_interspeech_2020/train.py", line 99, in <module>
    entry(local_rank, configuration, args.resume, args.only_validation)
  File "/home/github/FullSubNet/recipes/dns_interspeech_2020/train.py", line 59, in entry
    trainer = trainer_class(
  File "/home/github/FullSubNet/recipes/dns_interspeech_2020/fullsubnet/trainer.py", line 17, in __init__
    super().__init__(dist, rank, config, resume, only_validation, model, loss_function, optimizer)
  File "/home/github/FullSubNet/audio_zen/trainer/base_trainer.py", line 84, in __init__
    self._resume_checkpoint()
  File "/home/github/FullSubNet/audio_zen/trainer/base_trainer.py", line 153, in _resume_checkpoint
    self.scaler.load_state_dict(checkpoint["scaler"])
  File "/home/anaconda3/envs/FullSubNet/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 502, in load_state_dict
    raise RuntimeError("The source state dict is empty, possibly because it was saved "
RuntimeError: The source state dict is empty, possibly because it was saved from a disabled instance of GradScaler.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1822537) of binary: /home/anaconda3/envs/FullSubNet/bin/python
Traceback (most recent call last):
  File "/home/anaconda3/envs/FullSubNet/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/home/anaconda3/envs/FullSubNet/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/anaconda3/envs/FullSubNet/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/anaconda3/envs/FullSubNet/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/anaconda3/envs/FullSubNet/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/anaconda3/envs/FullSubNet/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-06-19_21:25:20
  host      : host-server
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1822537)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Are there specific modifications that need to be made to continue training?

Thank you for your help.

The text was updated successfully, but these errors were encountered:

jhkonan · 2022-06-29T23:32:46Z

Just following up again to this. Without a working baseline, we may have to train from scratch using your workflow to get a comparable. This is okay, but, when comparing different results to your paper, it may not be as good as fine-tuning your pre-trained model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to fine-tune pre-trained model (fullsubnet_best_model_58epochs.tar) #48

Unable to fine-tune pre-trained model (fullsubnet_best_model_58epochs.tar) #48

jhkonan commented Jun 20, 2022

jhkonan commented Jun 29, 2022

Unable to fine-tune pre-trained model (fullsubnet_best_model_58epochs.tar) #48

Unable to fine-tune pre-trained model (fullsubnet_best_model_58epochs.tar) #48

Comments

jhkonan commented Jun 20, 2022

jhkonan commented Jun 29, 2022