ValueError: Default process group has not been initialized, please make sure to call init_process_group #55

MKaczkow · 2024-04-18T19:18:17Z

First of all, thanks for providing this code 😄

tl;dr

I am getting ValueError when trying to run eval on iNat21 dataset with python -m evals.main --fname configs/evals/vitl16_inat.yaml --devices cuda:0 and running out of ideas how to fix it.

Config values

I try to run eval, using single GPU, local machine
dataset I use is iNaturalist-2021
configs\evals\vith16_inat.yaml look like this:

nodes: 8
tasks_per_node: 8
tag: inat-16f
eval_name: image_classification_frozen
resume_checkpoint: false
data:
  root_path: D:\__repos\jepa\data
  image_folder: inat
  num_classes: 10000
  resolution: 224
  dataset_name: iNat21
optimization:
  num_epochs: 20
  batch_size: 16
  weight_decay: 0.001
  lr: 0.001
  start_lr: 0.001
  final_lr: 0.0
  warmup: 0.
  use_bfloat16: true
pretrain:
  model_name: vit_large
  checkpoint_key: target_encoder
  clip_duration: null
  frames_per_clip: 16
  tubelet_size: 2
  uniform_power: true
  use_sdpa: true
  use_silu: false
  tight_silu: false
  patch_size: 16
  folder: D:\__repos\jepa\models
  checkpoint: vitl16.pth.tar  # name of pretrained model file inside folder
  write_tag: jepa

packages' versions:

Package            Version
------------------ ------------
certifi            2024.2.2
charset-normalizer 3.3.2
colorama           0.4.6
filelock           3.9.0
fsspec             2024.3.1
huggingface-hub    0.22.2
idna               3.7
Jinja2             3.1.2
MarkupSafe         2.1.3
mpmath             1.3.0
networkx           3.2.1
numpy              1.26.3
packaging          24.0
pillow             10.2.0
pip                22.0.4
PyYAML             6.0.1
requests           2.31.0
safetensors        0.4.3
setuptools         58.1.0
sympy              1.12
timm               0.9.16
torch              2.2.2+cu118
torchvision        0.17.2+cu118
tqdm               4.66.2
typing_extensions  4.8.0
urllib3            2.2.1

I have tried

as far as I understand, mentioned problem is caused by torch.distributed being available, but not initialized, but I haven't been able to pinpoint where this happens
I've run this little 'checklist' from SO and got

>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
0
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 3060'

this pytorch forum post suggest wrong usage of DistributedDataParallel is the root cause, but I haven't found it in the repo
this GitHub issue suggested SyncBatchNorm behaving in unexpected way, when running on single GPU, but this has already been fixed in this PR
this problem also seems similar to this and this issues
I've also tried commenting-out this lines:

    world_size, rank = init_distributed(rank_and_world_size=(rank, world_size))
    logger.info(f'Running... (rank: {rank}/{world_size})')

in evals.main, to avoid using of init_distributed function

Full stacktrace

(venv) PS D:\__repos\jepa> python -m evals.main --fname configs/evals/vitl16_inat.yaml
INFO:root:called-params configs/evals/vitl16_inat.yaml
INFO:root:loaded params...
{   'data': {   'dataset_name': 'iNat21',
                'image_folder': 'inat',
                'num_classes': 10000,
                'resolution': 224,
                'root_path': 'D:\\__repos\\jepa\\data'},
    'eval_name': 'image_classification_frozen',
    'nodes': 8,
    'optimization': {   'batch_size': 16,
                        'final_lr': 0.0,
                        'lr': 0.001,
                        'num_epochs': 20,
                        'start_lr': 0.001,
                        'use_bfloat16': True,
                        'warmup': 0.0,
                        'weight_decay': 0.001},
    'pretrain': {   'checkpoint': 'vitl16.pth.tar',
                    'checkpoint_key': 'target_encoder',
                    'clip_duration': None,
                    'folder': 'D:\\__repos\\jepa\\models',
                    'frames_per_clip': 16,
                    'model_name': 'vit_large',
                    'patch_size': 16,
                    'tight_silu': False,
                    'tubelet_size': 2,
                    'uniform_power': True,
                    'use_sdpa': True,
                    'use_silu': False,
                    'write_tag': 'jepa'},
    'resume_checkpoint': False,
    'tag': 'inat-16f',
    'tasks_per_node': 8}
D:\__repos\jepa\venv\lib\site-packages\torch\distributed\distributed_c10d.py:608: UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled
  warnings.warn("Attempted to get default timeout for nccl backend, but NCCL support is not compiled")
INFO:root:Rank: 0. Distributed training not available Distributed package doesn't have NCCL built in
INFO:root:Running... (rank: 0/1)
INFO:root:Running evaluation: image_classification_frozen
INFO:root:SLURM vars not set (distributed training not available)
INFO:root:Initialized (rank/world-size) 0/1
INFO:root:Loading pretrained model from D:\__repos\jepa\models\vitl16.pth.tar
VisionTransformer(
  (patch_embed): PatchEmbed3D(
    (proj): Conv3d(3, 1024, kernel_size=(2, 16, 16), stride=(2, 16, 16))
  )
  (blocks): ModuleList(
    (0-23): 24 x Block(
      (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
      (attn): Attention(
        (qkv): Linear(in_features=1024, out_features=3072, bias=True)
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj): Linear(in_features=1024, out_features=1024, bias=True)
        (proj_drop): Dropout(p=0.0, inplace=False)
      )
      (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
      (mlp): MLP(
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (act): GELU(approximate='none')
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (drop): Dropout(p=0.0, inplace=False)
      )
    )
  )
  (norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
)
INFO:root:loaded pretrained model with msg: <All keys matched successfully>
INFO:root:loaded pretrained encoder from epoch: 300
 path: D:\__repos\jepa\models\vitl16.pth.tar
INFO:root:implementing auto-agument strategy
INFO:root:data-path D:\__repos\jepa\data\inat\train/
INFO:root:Initialized ImageFolder
INFO:root:ImageFolder dataset created
INFO:root:ImageFolder unsupervised data loader created
INFO:root:data-path D:\__repos\jepa\data\inat\val/
INFO:root:Initialized ImageFolder
INFO:root:ImageFolder dataset created
INFO:root:ImageFolder unsupervised data loader created
INFO:root:Dataloader created... iterations per epoch: 31250
INFO:root:Using AdamW
Process Process-1:
Traceback (most recent call last):
  File "C:\Users\Maciek\AppData\Local\Programs\Python\Python310\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
  File "C:\Users\Maciek\AppData\Local\Programs\Python\Python310\lib\multiprocessing\process.py", line 108, in run    self._target(*self._args, **self._kwargs)
  File "D:\__repos\jepa\evals\main.py", line 57, in process_main
    eval_main(params['eval_name'], args_eval=params)
  File "D:\__repos\jepa\evals\scaffold.py", line 22, in main
    return importlib.import_module(f'evals.{eval_name}.eval').main(
  File "D:\__repos\jepa\evals\image_classification_frozen\eval.py", line 201, in main
    classifier = DistributedDataParallel(classifier, static_graph=True)
  File "D:\__repos\jepa\venv\lib\site-packages\torch\nn\parallel\distributed.py", line 731, in __init__
    self.process_group = _get_default_group()
  File "D:\__repos\jepa\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 977, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

The text was updated successfully, but these errors were encountered:

LangDaniel · 2024-04-26T12:20:19Z

you could just remove the lines initializing the DistributedDataParallel in app/vjepa/train.py , i.e. lines 295-297, as a quick fix.

MKaczkow · 2024-05-03T12:50:24Z

It didn't help, I am afraid, still getting:

ValueError: Default process group has not been initialized, please make sure to call init_process_group.

LangDaniel · 2024-05-06T13:31:20Z

Did you also try to remove it from the eval scripts, i.e. line 201 in evals/image_classification_frozen/eval.py?

krstevskipetar · 2024-05-16T16:52:38Z

I faced this same issue using a single GPU on one machine, I got it working by changing the port and explicitly defining the rank and world size. For evaluation you can edit line 131 in evals/video_classification_frozen/eval.py to be
world_size, rank = init_distributed(port=12321, rank_and_world_size=(0, 1))

ValueError: Default process group has not been initialized, please make sure to call init_process_group facebookresearch#55

IndexError: index 1 is out of bounds for axis 1 with size 1 facebookresearch#55 (comment)

saten-private added a commit to saten-private/jepa that referenced this issue Jun 12, 2024

Corrected due to the following error

8fcd2b4

ValueError: Default process group has not been initialized, please make sure to call init_process_group facebookresearch#55

saten-private added a commit to saten-private/jepa that referenced this issue Jun 13, 2024

Corrected due to the following error

f9af285

ValueError: Default process group has not been initialized, please make sure to call init_process_group facebookresearch#55

saten-private added a commit to saten-private/jepa that referenced this issue Jun 13, 2024

Corrected due to the following error

2b7b84d

IndexError: index 1 is out of bounds for axis 1 with size 1 facebookresearch#55 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: Default process group has not been initialized, please make sure to call init_process_group #55

ValueError: Default process group has not been initialized, please make sure to call init_process_group #55

MKaczkow commented Apr 18, 2024 •

edited

Loading

LangDaniel commented Apr 26, 2024

MKaczkow commented May 3, 2024

LangDaniel commented May 6, 2024

krstevskipetar commented May 16, 2024

ValueError: Default process group has not been initialized, please make sure to call init_process_group #55

ValueError: Default process group has not been initialized, please make sure to call init_process_group #55

Comments

MKaczkow commented Apr 18, 2024 • edited Loading

tl;dr

Config values

I have tried

Full stacktrace

LangDaniel commented Apr 26, 2024

MKaczkow commented May 3, 2024

LangDaniel commented May 6, 2024

krstevskipetar commented May 16, 2024

MKaczkow commented Apr 18, 2024 •

edited

Loading