Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: Default process group is not initialized #510

Open
drfinkus opened this issue Nov 8, 2020 · 1 comment
Open

AssertionError: Default process group is not initialized #510

drfinkus opened this issue Nov 8, 2020 · 1 comment

Comments

@drfinkus
Copy link

drfinkus commented Nov 8, 2020

I am trying to integrate DeepSpeed with pytorch-lightning, and have reached a point where I can't see the next step. See below the error trace.

AssertionError                            Traceback (most recent call last)
<ipython-input-3-a0ebb7b60e39> in <module>
----> 1 ai.train("data.txt", num_steps=5000, save_every=1000,save_gdrive=False,learning_rate=1e-4,batch_size=1)

~/miniconda3/envs/ait2/lib/python3.7/site-packages/aitextgen-0.2.3-py3.7.egg/aitextgen/aitextgen.py in train(self, train_data, output_dir, fp16, fp16_opt_level, n_gpu, n_tpu_cores, max_grad_norm, gradient_accumulation_steps, seed, learning_rate, weight_decay, adam_epsilon, warmup_steps, num_steps, save_every, generate_every, n_generate, loggers, batch_size, num_workers, benchmark, avg_loss_smoothing, save_gdrive, run_id, progress_bar_refresh_rate, **kwargs)
    570 
    571         trainer = pl.Trainer(**train_params)
--> 572         trainer.fit(train_model)
    573 
    574         logger.info(f"Saving trained model pytorch_model.bin to /{output_dir}")

~/miniconda3/envs/ait2/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders)
    977 
    978         elif self.single_gpu:
--> 979             self.single_gpu_train(model)
    980 
    981         elif self.use_tpu:  # pragma: no-cover

~/miniconda3/envs/ait2/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py in single_gpu_train(self, model)
    174         # CHOOSE OPTIMIZER
    175         # allow for lr schedulers as well
--> 176         self.optimizers, self.lr_schedulers, self.optimizer_frequencies = self.init_optimizers(model)
    177 
    178         # TODO: remove with dropping NVIDIA AMP support

~/miniconda3/envs/ait2/lib/python3.7/site-packages/pytorch_lightning/trainer/optimizers.py in init_optimizers(self, model)
     16             model: LightningModule
     17     ) -> Tuple[List, List, List]:
---> 18         optim_conf = model.configure_optimizers()
     19 
     20         if optim_conf is None:

~/miniconda3/envs/ait2/lib/python3.7/site-packages/aitextgen-0.2.3-py3.7.egg/aitextgen/train.py in configure_optimizers(self)
    117                 optimizer=optimizer,
    118                 lr_scheduler=scheduler,
--> 119                 dist_init_required=False
    120             )
    121 

~/miniconda3/envs/ait2/lib/python3.7/site-packages/deepspeed/__init__.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params)
    119                                  dist_init_required=dist_init_required,
    120                                  collate_fn=collate_fn,
--> 121                                  config_params=config_params)
    122     else:
    123         assert mpu is None, "mpu must be None with pipeline parallelism"

~/miniconda3/envs/ait2/lib/python3.7/site-packages/deepspeed/runtime/engine.py in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params)
    155 
    156         # Configure distributed model
--> 157         self._configure_distributed_model(model)
    158 
    159         # Configure wall clock timer

~/miniconda3/envs/ait2/lib/python3.7/site-packages/deepspeed/runtime/engine.py in _configure_distributed_model(self, model)
    487 
    488         if self.mpu is None:
--> 489             self.data_parallel_group = _initialize_parameter_parallel_groups()
    490             self.dp_world_size = dist.get_world_size()
    491             self.mp_world_size = 1

~/miniconda3/envs/ait2/lib/python3.7/site-packages/deepspeed/runtime/engine.py in _initialize_parameter_parallel_groups(parameter_parallel_size)
     71 
     72 def _initialize_parameter_parallel_groups(parameter_parallel_size=None):
---> 73     data_parallel_size = int(dist.get_world_size())
     74     if parameter_parallel_size is None:
     75         parameter_parallel_size = int(data_parallel_size)

~/miniconda3/envs/ait2/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in get_world_size(group)
    623         return -1
    624 
--> 625     return _get_group_size(group)
    626 
    627 

~/miniconda3/envs/ait2/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _get_group_size(group)
    218     """
    219     if group is GroupMember.WORLD:
--> 220         _check_default_pg()
    221         return _default_pg.size()
    222     if group not in _pg_group_ranks:

~/miniconda3/envs/ait2/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _check_default_pg()
    209     """
    210     assert _default_pg is not None, \
--> 211         "Default process group is not initialized"
    212 
    213 

AssertionError: Default process group is not initialized

It seems that a call to init_process_group is missing somewhere but I can't be sure.

@drfinkus
Copy link
Author

drfinkus commented Nov 8, 2020

More information. I am initializing the model as follows:

            self.model, optimizer, _, scheduler = deepspeed.initialize(
                args=args,
                model=self.model,
                optimizer=optimizer,
                lr_scheduler=scheduler,
                dist_init_required=False
            )

If I comment out the dist_init_required flag, I get the following error:

ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

It seems to be related to how the model is initialized, but I'm unclear what I am doing wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant