Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Default process group has not been initialized, please make sure to call init_process_group #55

Open
MKaczkow opened this issue Apr 18, 2024 · 4 comments

Comments

@MKaczkow
Copy link

MKaczkow commented Apr 18, 2024

First of all, thanks for providing this code 😄

tl;dr

I am getting ValueError when trying to run eval on iNat21 dataset with python -m evals.main --fname configs/evals/vitl16_inat.yaml --devices cuda:0 and running out of ideas how to fix it.

Config values

  • I try to run eval, using single GPU, local machine
  • dataset I use is iNaturalist-2021
  • configs\evals\vith16_inat.yaml look like this:
nodes: 8
tasks_per_node: 8
tag: inat-16f
eval_name: image_classification_frozen
resume_checkpoint: false
data:
  root_path: D:\__repos\jepa\data
  image_folder: inat
  num_classes: 10000
  resolution: 224
  dataset_name: iNat21
optimization:
  num_epochs: 20
  batch_size: 16
  weight_decay: 0.001
  lr: 0.001
  start_lr: 0.001
  final_lr: 0.0
  warmup: 0.
  use_bfloat16: true
pretrain:
  model_name: vit_large
  checkpoint_key: target_encoder
  clip_duration: null
  frames_per_clip: 16
  tubelet_size: 2
  uniform_power: true
  use_sdpa: true
  use_silu: false
  tight_silu: false
  patch_size: 16
  folder: D:\__repos\jepa\models
  checkpoint: vitl16.pth.tar  # name of pretrained model file inside folder
  write_tag: jepa
  • packages' versions:
Package            Version
------------------ ------------
certifi            2024.2.2
charset-normalizer 3.3.2
colorama           0.4.6
filelock           3.9.0
fsspec             2024.3.1
huggingface-hub    0.22.2
idna               3.7
Jinja2             3.1.2
MarkupSafe         2.1.3
mpmath             1.3.0
networkx           3.2.1
numpy              1.26.3
packaging          24.0
pillow             10.2.0
pip                22.0.4
PyYAML             6.0.1
requests           2.31.0
safetensors        0.4.3
setuptools         58.1.0
sympy              1.12
timm               0.9.16
torch              2.2.2+cu118
torchvision        0.17.2+cu118
tqdm               4.66.2
typing_extensions  4.8.0
urllib3            2.2.1

I have tried

  • as far as I understand, mentioned problem is caused by torch.distributed being available, but not initialized, but I haven't been able to pinpoint where this happens
  • I've run this little 'checklist' from SO and got
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
0
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 3060'
  • this pytorch forum post suggest wrong usage of DistributedDataParallel is the root cause, but I haven't found it in the repo
  • this GitHub issue suggested SyncBatchNorm behaving in unexpected way, when running on single GPU, but this has already been fixed in this PR
  • this problem also seems similar to this and this issues
  • I've also tried commenting-out this lines:
    world_size, rank = init_distributed(rank_and_world_size=(rank, world_size))
    logger.info(f'Running... (rank: {rank}/{world_size})')

in evals.main, to avoid using of init_distributed function

Full stacktrace

(venv) PS D:\__repos\jepa> python -m evals.main --fname configs/evals/vitl16_inat.yaml
INFO:root:called-params configs/evals/vitl16_inat.yaml
INFO:root:loaded params...
{   'data': {   'dataset_name': 'iNat21',
                'image_folder': 'inat',
                'num_classes': 10000,
                'resolution': 224,
                'root_path': 'D:\\__repos\\jepa\\data'},
    'eval_name': 'image_classification_frozen',
    'nodes': 8,
    'optimization': {   'batch_size': 16,
                        'final_lr': 0.0,
                        'lr': 0.001,
                        'num_epochs': 20,
                        'start_lr': 0.001,
                        'use_bfloat16': True,
                        'warmup': 0.0,
                        'weight_decay': 0.001},
    'pretrain': {   'checkpoint': 'vitl16.pth.tar',
                    'checkpoint_key': 'target_encoder',
                    'clip_duration': None,
                    'folder': 'D:\\__repos\\jepa\\models',
                    'frames_per_clip': 16,
                    'model_name': 'vit_large',
                    'patch_size': 16,
                    'tight_silu': False,
                    'tubelet_size': 2,
                    'uniform_power': True,
                    'use_sdpa': True,
                    'use_silu': False,
                    'write_tag': 'jepa'},
    'resume_checkpoint': False,
    'tag': 'inat-16f',
    'tasks_per_node': 8}
D:\__repos\jepa\venv\lib\site-packages\torch\distributed\distributed_c10d.py:608: UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled
  warnings.warn("Attempted to get default timeout for nccl backend, but NCCL support is not compiled")
INFO:root:Rank: 0. Distributed training not available Distributed package doesn't have NCCL built in
INFO:root:Running... (rank: 0/1)
INFO:root:Running evaluation: image_classification_frozen
INFO:root:SLURM vars not set (distributed training not available)
INFO:root:Initialized (rank/world-size) 0/1
INFO:root:Loading pretrained model from D:\__repos\jepa\models\vitl16.pth.tar
VisionTransformer(
  (patch_embed): PatchEmbed3D(
    (proj): Conv3d(3, 1024, kernel_size=(2, 16, 16), stride=(2, 16, 16))
  )
  (blocks): ModuleList(
    (0-23): 24 x Block(
      (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
      (attn): Attention(
        (qkv): Linear(in_features=1024, out_features=3072, bias=True)
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj): Linear(in_features=1024, out_features=1024, bias=True)
        (proj_drop): Dropout(p=0.0, inplace=False)
      )
      (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
      (mlp): MLP(
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (act): GELU(approximate='none')
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (drop): Dropout(p=0.0, inplace=False)
      )
    )
  )
  (norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
)
INFO:root:loaded pretrained model with msg: <All keys matched successfully>
INFO:root:loaded pretrained encoder from epoch: 300
 path: D:\__repos\jepa\models\vitl16.pth.tar
INFO:root:implementing auto-agument strategy
INFO:root:data-path D:\__repos\jepa\data\inat\train/
INFO:root:Initialized ImageFolder
INFO:root:ImageFolder dataset created
INFO:root:ImageFolder unsupervised data loader created
INFO:root:data-path D:\__repos\jepa\data\inat\val/
INFO:root:Initialized ImageFolder
INFO:root:ImageFolder dataset created
INFO:root:ImageFolder unsupervised data loader created
INFO:root:Dataloader created... iterations per epoch: 31250
INFO:root:Using AdamW
Process Process-1:
Traceback (most recent call last):
  File "C:\Users\Maciek\AppData\Local\Programs\Python\Python310\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
  File "C:\Users\Maciek\AppData\Local\Programs\Python\Python310\lib\multiprocessing\process.py", line 108, in run    self._target(*self._args, **self._kwargs)
  File "D:\__repos\jepa\evals\main.py", line 57, in process_main
    eval_main(params['eval_name'], args_eval=params)
  File "D:\__repos\jepa\evals\scaffold.py", line 22, in main
    return importlib.import_module(f'evals.{eval_name}.eval').main(
  File "D:\__repos\jepa\evals\image_classification_frozen\eval.py", line 201, in main
    classifier = DistributedDataParallel(classifier, static_graph=True)
  File "D:\__repos\jepa\venv\lib\site-packages\torch\nn\parallel\distributed.py", line 731, in __init__
    self.process_group = _get_default_group()
  File "D:\__repos\jepa\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 977, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
@LangDaniel
Copy link

you could just remove the lines initializing the DistributedDataParallel in app/vjepa/train.py , i.e. lines 295-297, as a quick fix.

@MKaczkow
Copy link
Author

MKaczkow commented May 3, 2024

It didn't help, I am afraid, still getting:

ValueError: Default process group has not been initialized, please make sure to call init_process_group.

@LangDaniel
Copy link

Did you also try to remove it from the eval scripts, i.e. line 201 in evals/image_classification_frozen/eval.py?

@krstevskipetar
Copy link

I faced this same issue using a single GPU on one machine, I got it working by changing the port and explicitly defining the rank and world size. For evaluation you can edit line 131 in evals/video_classification_frozen/eval.py to be
world_size, rank = init_distributed(port=12321, rank_and_world_size=(0, 1))

saten-private added a commit to saten-private/jepa that referenced this issue Jun 12, 2024
ValueError: Default process group has not been initialized, please make sure to call init_process_group
facebookresearch#55
saten-private added a commit to saten-private/jepa that referenced this issue Jun 13, 2024
ValueError: Default process group has not been initialized, please make sure to call init_process_group
facebookresearch#55
saten-private added a commit to saten-private/jepa that referenced this issue Jun 13, 2024
IndexError: index 1 is out of bounds for axis 1 with size 1
facebookresearch#55 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants