We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I am trying to integrate DeepSpeed with pytorch-lightning, and have reached a point where I can't see the next step. See below the error trace.
AssertionError Traceback (most recent call last) <ipython-input-3-a0ebb7b60e39> in <module> ----> 1 ai.train("data.txt", num_steps=5000, save_every=1000,save_gdrive=False,learning_rate=1e-4,batch_size=1) ~/miniconda3/envs/ait2/lib/python3.7/site-packages/aitextgen-0.2.3-py3.7.egg/aitextgen/aitextgen.py in train(self, train_data, output_dir, fp16, fp16_opt_level, n_gpu, n_tpu_cores, max_grad_norm, gradient_accumulation_steps, seed, learning_rate, weight_decay, adam_epsilon, warmup_steps, num_steps, save_every, generate_every, n_generate, loggers, batch_size, num_workers, benchmark, avg_loss_smoothing, save_gdrive, run_id, progress_bar_refresh_rate, **kwargs) 570 571 trainer = pl.Trainer(**train_params) --> 572 trainer.fit(train_model) 573 574 logger.info(f"Saving trained model pytorch_model.bin to /{output_dir}") ~/miniconda3/envs/ait2/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders) 977 978 elif self.single_gpu: --> 979 self.single_gpu_train(model) 980 981 elif self.use_tpu: # pragma: no-cover ~/miniconda3/envs/ait2/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py in single_gpu_train(self, model) 174 # CHOOSE OPTIMIZER 175 # allow for lr schedulers as well --> 176 self.optimizers, self.lr_schedulers, self.optimizer_frequencies = self.init_optimizers(model) 177 178 # TODO: remove with dropping NVIDIA AMP support ~/miniconda3/envs/ait2/lib/python3.7/site-packages/pytorch_lightning/trainer/optimizers.py in init_optimizers(self, model) 16 model: LightningModule 17 ) -> Tuple[List, List, List]: ---> 18 optim_conf = model.configure_optimizers() 19 20 if optim_conf is None: ~/miniconda3/envs/ait2/lib/python3.7/site-packages/aitextgen-0.2.3-py3.7.egg/aitextgen/train.py in configure_optimizers(self) 117 optimizer=optimizer, 118 lr_scheduler=scheduler, --> 119 dist_init_required=False 120 ) 121 ~/miniconda3/envs/ait2/lib/python3.7/site-packages/deepspeed/__init__.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params) 119 dist_init_required=dist_init_required, 120 collate_fn=collate_fn, --> 121 config_params=config_params) 122 else: 123 assert mpu is None, "mpu must be None with pipeline parallelism" ~/miniconda3/envs/ait2/lib/python3.7/site-packages/deepspeed/runtime/engine.py in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params) 155 156 # Configure distributed model --> 157 self._configure_distributed_model(model) 158 159 # Configure wall clock timer ~/miniconda3/envs/ait2/lib/python3.7/site-packages/deepspeed/runtime/engine.py in _configure_distributed_model(self, model) 487 488 if self.mpu is None: --> 489 self.data_parallel_group = _initialize_parameter_parallel_groups() 490 self.dp_world_size = dist.get_world_size() 491 self.mp_world_size = 1 ~/miniconda3/envs/ait2/lib/python3.7/site-packages/deepspeed/runtime/engine.py in _initialize_parameter_parallel_groups(parameter_parallel_size) 71 72 def _initialize_parameter_parallel_groups(parameter_parallel_size=None): ---> 73 data_parallel_size = int(dist.get_world_size()) 74 if parameter_parallel_size is None: 75 parameter_parallel_size = int(data_parallel_size) ~/miniconda3/envs/ait2/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in get_world_size(group) 623 return -1 624 --> 625 return _get_group_size(group) 626 627 ~/miniconda3/envs/ait2/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _get_group_size(group) 218 """ 219 if group is GroupMember.WORLD: --> 220 _check_default_pg() 221 return _default_pg.size() 222 if group not in _pg_group_ranks: ~/miniconda3/envs/ait2/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _check_default_pg() 209 """ 210 assert _default_pg is not None, \ --> 211 "Default process group is not initialized" 212 213 AssertionError: Default process group is not initialized
It seems that a call to init_process_group is missing somewhere but I can't be sure.
The text was updated successfully, but these errors were encountered:
More information. I am initializing the model as follows:
self.model, optimizer, _, scheduler = deepspeed.initialize( args=args, model=self.model, optimizer=optimizer, lr_scheduler=scheduler, dist_init_required=False )
If I comment out the dist_init_required flag, I get the following error:
dist_init_required
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
It seems to be related to how the model is initialized, but I'm unclear what I am doing wrong.
Sorry, something went wrong.
No branches or pull requests
I am trying to integrate DeepSpeed with pytorch-lightning, and have reached a point where I can't see the next step. See below the error trace.
It seems that a call to init_process_group is missing somewhere but I can't be sure.
The text was updated successfully, but these errors were encountered: