Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HOW TO: Training in Google Colab (Single T4) and "NotImplementedError" #39

LeMosquitar opened this issue Feb 28, 2025 · 6 comments


Copy link

LeMosquitar commented Feb 28, 2025

Hello, I am trying to play around with what is here. Thank you for your efforts by the way!

  1. I tried to run the project in Google colab, cloned the repo installed requirements and ran inference.
  2. I got output which tells me I have Installed things properly
  3. I then prepare for training
    -> I followed folder structure and dataset format
    -> Went to custom_detection.yml and changed coco remap to false
    -> I also changed the parameters in custom_detection.yml as gleaned below:
task: detection

  type: CocoEvaluator
  iou_types: ['bbox', ]

num_classes: 3 # your dataset classes
remap_mscoco_category: False

  type: DataLoader
    type: CocoDetection
    img_folder: /content/drive/MyDrive/v9-v1_augmented.coco/images/train
    ann_file: /content/drive/MyDrive/v9-v1_augmented.coco/annotations/instances_train.json
    return_masks: False
      type: Compose
      ops: ~
  shuffle: True
  num_workers: 4
  drop_last: True
    type: BatchImageCollateFunction

  type: DataLoader
    type: CocoDetection
    img_folder: /content/drive/MyDrive/v9-v1_augmented.coco/images/val
    ann_file: /content/drive/MyDrive/v9-v1_augmented.coco/annotations/instances_val.json
    return_masks: False
      type: Compose
      ops: ~
  shuffle: False
  num_workers: 4
  drop_last: False
    type: BatchImageCollateFunction

And my dataloader.yml to (rduce batch size):

        - {type: RandomPhotometricDistort, p: 0.5}
        - {type: RandomZoomOut, fill: 0}
        - {type: RandomIoUCrop, p: 0.8}
        - {type: SanitizeBoundingBoxes, min_size: 1}
        - {type: RandomHorizontalFlip}
        - {type: Resize, size: [640, 640], }
        - {type: SanitizeBoundingBoxes, min_size: 1}
        - {type: ConvertPILImage, dtype: 'float32', scale: True}
        - {type: ConvertBoxes, fmt: 'cxcywh', normalize: True}
        name: stop_epoch
        epoch: 72 # epoch in [71, ~) stop `ops`
        ops: ['Mosaic', 'RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop']

    type: BatchImageCollateFunction
    base_size: 640
    base_size_repeat: 3
    stop_epoch: 72 # epoch in [72, ~) stop `multiscales`

  shuffle: True
  total_batch_size: 8 # total batch size equals to 32 (4 * 8)
  num_workers: 4

        - {type: Resize, size: [640, 640], }
        - {type: ConvertPILImage, dtype: 'float32', scale: True}
  shuffle: False
  total_batch_size: 8
  num_workers: 4
  1. I then did not modify anything else and proceeded to the training using the command:
!CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 -c "/content/DEIM/configs/deim_rtdetrv2/deim_r18vd_120e_coco.yml" --use-amp --seed=0 -t "/content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth"

I then got the following output:

2025-02-28 09:13:07.162205: E external/local_xla/xla/stream_executor/cuda/] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1740733987.183540   13770] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740733987.190107   13770] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-28 09:13:07.211146: I tensorflow/core/platform/] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Initialized distributed mode...
cfg:  {'task': 'detection', '_model': None, '_postprocessor': None, '_criterion': None, '_optimizer': None, '_lr_scheduler': None, '_lr_warmup_scheduler': None, '_train_dataloader': None, '_val_dataloader': None, '_ema': None, '_scaler': None, '_train_dataset': None, '_val_dataset': None, '_collate_fn': None, '_evaluator': None, '_writer': None, 'num_workers': 0, 'batch_size': None, '_train_batch_size': None, '_val_batch_size': None, '_train_shuffle': None, '_val_shuffle': None, 'resume': None, 'tuning': '/content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth', 'epoches': 120, 'last_epoch': -1, 'lrsheduler': 'flatcosine', 'lr_gamma': 0.5, 'no_aug_epoch': 3, 'warmup_iter': 2000, 'flat_epoch': 64, 'use_amp': True, 'use_ema': True, 'ema_decay': 0.9999, 'ema_warmups': 2000, 'sync_bn': True, 'clip_max_norm': 0.1, 'find_unused_parameters': False, 'seed': 0, 'print_freq': 100, 'checkpoint_freq': 4, 'output_dir': './output/deim_rtdetrv2_r18vd_120e_coco', 'summary_dir': None, 'device': '', 'yaml_cfg': {'task': 'detection', 'evaluator': {'type': 'CocoEvaluator', 'iou_types': ['bbox']}, 'num_classes': 80, 'remap_mscoco_category': False, 'train_dataloader': {'type': 'DataLoader', 'dataset': {'type': 'CocoDetection', 'img_folder': '/datassd/COCO/train2017/', 'ann_file': '/datassd/COCO/annotations/instances_train2017.json', 'return_masks': False, 'transforms': {'type': 'Compose', 'ops': [{'type': 'Mosaic', 'output_size': 320, 'rotation_range': 10, 'translation_range': [0.1, 0.1], 'scaling_range': [0.5, 1.5], 'probability': 1.0, 'fill_value': 0, 'use_cache': False, 'max_cached_images': 50, 'random_pop': True}, {'type': 'RandomPhotometricDistort', 'p': 0.5}, {'type': 'RandomZoomOut', 'fill': 0}, {'type': 'RandomIoUCrop', 'p': 0.8}, {'type': 'SanitizeBoundingBoxes', 'min_size': 1}, {'type': 'RandomHorizontalFlip'}, {'type': 'Resize', 'size': [640, 640]}, {'type': 'SanitizeBoundingBoxes', 'min_size': 1}, {'type': 'ConvertPILImage', 'dtype': 'float32', 'scale': True}, {'type': 'ConvertBoxes', 'fmt': 'cxcywh', 'normalize': True}], 'policy': {'name': 'stop_epoch', 'epoch': [4, 64, 117], 'ops': ['Mosaic', 'RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop']}, 'mosaic_prob': 0.5}}, 'shuffle': True, 'num_workers': 4, 'drop_last': True, 'collate_fn': {'type': 'BatchImageCollateFunction', 'base_size': 640, 'base_size_repeat': 3, 'stop_epoch': 117, 'scales': None, 'mixup_prob': 0.5, 'mixup_epochs': [4, 64]}, 'total_batch_size': 16}, 'val_dataloader': {'type': 'DataLoader', 'dataset': {'type': 'CocoDetection', 'img_folder': '/datassd/COCO/val2017/', 'ann_file': '/datassd/COCO/annotations/instances_val2017.json', 'return_masks': False, 'transforms': {'type': 'Compose', 'ops': [{'type': 'Resize', 'size': [640, 640]}, {'type': 'ConvertPILImage', 'dtype': 'float32', 'scale': True}]}}, 'shuffle': False, 'num_workers': 4, 'drop_last': False, 'collate_fn': {'type': 'BatchImageCollateFunction'}, 'total_batch_size': 8}, 'print_freq': 100, 'output_dir': './output/deim_rtdetrv2_r18vd_120e_coco', 'checkpoint_freq': 4, 'sync_bn': True, 'find_unused_parameters': False, 'use_amp': True, 'scaler': {'type': 'GradScaler', 'enabled': True}, 'use_ema': True, 'ema': {'type': 'ModelEMA', 'decay': 0.9999, 'warmups': 2000, 'start': 0}, 'epoches': 120, 'clip_max_norm': 0.1, 'optimizer': {'type': 'AdamW', 'params': [{'params': '^(?=.*(?:norm|bn)).*$', 'weight_decay': 0.0}], 'lr': 0.0002, 'betas': [0.9, 0.999], 'weight_decay': 0.0001}, 'lr_scheduler': {'type': 'MultiStepLR', 'milestones': [1000], 'gamma': 0.1}, 'lr_warmup_scheduler': {'type': 'LinearWarmup', 'warmup_duration': 2000}, 'model': 'DEIM', 'criterion': 'DEIMCriterion', 'postprocessor': 'PostProcessor', 'use_focal_loss': True, 'eval_spatial_size': [640, 640], 'DEIM': {'backbone': 'PResNet', 'encoder': 'HybridEncoder', 'decoder': 'RTDETRTransformerv2'}, 'lrsheduler': 'flatcosine', 'lr_gamma': 0.5, 'warmup_iter': 2000, 'flat_epoch': 64, 'no_aug_epoch': 3, 'PResNet': {'depth': 18, 'variant': 'd', 'freeze_at': -1, 'return_idx': [1, 2, 3], 'num_stages': 4, 'freeze_norm': False, 'pretrained': True, 'local_model_dir': '../RT-DETR-main/rtdetrv2_pytorch/INK1k/'}, 'HybridEncoder': {'in_channels': [128, 256, 512], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'use_encoder_idx': [2], 'num_encoder_layers': 1, 'nhead': 8, 'dim_feedforward': 1024, 'dropout': 0.0, 'enc_act': 'gelu', 'expansion': 0.5, 'depth_mult': 1, 'act': 'silu', 'version': 'rt_detrv2'}, 'RTDETRTransformerv2': {'feat_channels': [256, 256, 256], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'num_levels': 3, 'num_layers': 3, 'num_queries': 300, 'num_denoising': 100, 'label_noise_ratio': 0.5, 'box_noise_scale': 1.0, 'eval_idx': -1, 'num_points': [4, 4, 4], 'cross_attn_method': 'default', 'query_select_method': 'default', 'query_pos_method': 'as_reg', 'activation': 'silu', 'mlp_act': 'silu'}, 'PostProcessor': {'num_top_queries': 300}, 'DEIMCriterion': {'weight_dict': {'loss_vfl': 1, 'loss_bbox': 5, 'loss_giou': 2, 'loss_mal': 1}, 'losses': ['mal', 'boxes'], 'alpha': 0.75, 'gamma': 1.5, 'use_uni_set': False, 'matcher': {'type': 'HungarianMatcher', 'weight_dict': {'cost_class': 2, 'cost_bbox': 5, 'cost_giou': 2}, 'alpha': 0.25, 'gamma': 2.0}}, '__include__': ['./rtdetrv2_r18vd_120e_coco.yml', '../base/rt_deim.yml'], 'config': '/content/DEIM/configs/deim_rtdetrv2/deim_r18vd_120e_coco.yml', 'tuning': '/content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth', 'seed': 0, 'test_only': False, 'print_method': 'builtin', 'print_rank': 0}}
/content/DEIM/engine/backbone/ FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state = torch.load(model_path, map_location='cpu')
Loaded PResNet18 from local file@../RT-DETR-main/rtdetrv2_pytorch/INK1k/ResNet18_vd_pretrained_from_paddle.pth.
Load PResNet18 state_dict
     ### Query Position Embedding@as_reg ###     
Tuning checkpoint from /content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth
/content/DEIM/engine/solver/ FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state = torch.load(path, map_location='cpu')
Load model.state_dict, {'missed': [], 'unmatched': []}
/content/DEIM/engine/core/ FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  return module(**module_kwargs)
Initial lr: [0.0002, 0.0002]
building train_dataloader with batch_size=16...
     ### Transform @Mosaic ###    
     ### Transform @RandomPhotometricDistort ###    
     ### Transform @RandomZoomOut ###    
     ### Transform @RandomIoUCrop ###    
     ### Transform @SanitizeBoundingBoxes ###    
     ### Transform @RandomHorizontalFlip ###    
     ### Transform @Resize ###    
     ### Transform @SanitizeBoundingBoxes ###    
     ### Transform @ConvertPILImage ###    
     ### Transform @ConvertBoxes ###    
     ### Mosaic with [email protected] and ZoomOut/IoUCrop existed ### 
     ### ImgTransforms Epochs: [4, 64, 117] ### 
     ### Policy_ops@['Mosaic', 'RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop'] ###
[rank0]: Traceback (most recent call last):
[rank0]:   File "/content/DEIM/", line 84, in <module>
[rank0]:     main(args)
[rank0]:   File "/content/DEIM/", line 54, in main
[rank0]:   File "/content/DEIM/engine/solver/", line 25, in fit
[rank0]:     self.train()
[rank0]:   File "/content/DEIM/engine/solver/", line 87, in train
[rank0]:     self.cfg.train_dataloader, shuffle=self.cfg.train_dataloader.shuffle
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/engine/core/", line 76, in train_dataloader
[rank0]:     self._train_dataloader = self.build_dataloader('train_dataloader')
[rank0]:                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/engine/core/", line 172, in build_dataloader
[rank0]:     loader = create(name, global_cfg, batch_size=bs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/engine/core/", line 119, in create
[rank0]:     return create(name, global_cfg)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/engine/core/", line 167, in create
[rank0]:     module_kwargs[k] = create(name, global_cfg)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/engine/core/", line 180, in create
[rank0]:     return module(**module_kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/engine/data/dataset/", line 33, in __init__
[rank0]:     super(CocoDetection, self).__init__(img_folder, ann_file)
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torchvision/datasets/", line 37, in __init__
[rank0]:     self.coco = COCO(annFile)
[rank0]:                 ^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/faster_coco_eval/core/", line 57, in __init__
[rank0]:     self.dataset = self.load_json(annotation_file, self.use_deepcopy)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/faster_coco_eval/core/", line 302, in load_json
[rank0]:     with open(json_file) as io:
[rank0]:          ^^^^^^^^^^^^^^^
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: '/datassd/COCO/annotations/instances_train2017.json'
E0228 09:13:17.895000 13755 torch/distributed/elastic/multiprocessing/] failed (exitcode: 1) local_rank: 0 (pid: 13770) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 10, in <module>
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/", line 355, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/", line 919, in main
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/", line 910, in run
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/", line 269, in launch_agent
    raise ChildFailedError(
============================================================ FAILED
Root Cause (first observed failure):
  time      : 2025-02-28_09:13:17
  host      : 2c5ae9ce8b33
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 13770)
  error_file: <N/A>
  traceback : To enable traceback see:

Is my method in training correct? I followed steps but I seem to be missing something. Also I notice that why does the training need to search for '/datassd/COCO/annotations/instances_train2017.json' when I am intending for custom dataset?

@LeMosquitar LeMosquitar changed the title Training in Google Colab (Single T4) HOW TO: Training in Google Colab (Single T4) Issue Feb 28, 2025
Copy link

I modified my command to:

!CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 -c "/content/DEIM/configs/deim_rtdetrv2/rtdetrv2_r18vd_120e_coco.yml" --use-amp --seed=0 -t "/content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth"

About the same output:

2025-02-28 11:01:51.726318: E external/local_xla/xla/stream_executor/cuda/] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1740740511.747945    8193] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740740511.754415    8193] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-28 11:01:51.776005: I tensorflow/core/platform/] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Initialized distributed mode...
cfg:  {'task': 'detection', '_model': None, '_postprocessor': None, '_criterion': None, '_optimizer': None, '_lr_scheduler': None, '_lr_warmup_scheduler': None, '_train_dataloader': None, '_val_dataloader': None, '_ema': None, '_scaler': None, '_train_dataset': None, '_val_dataset': None, '_collate_fn': None, '_evaluator': None, '_writer': None, 'num_workers': 0, 'batch_size': None, '_train_batch_size': None, '_val_batch_size': None, '_train_shuffle': None, '_val_shuffle': None, 'resume': None, 'tuning': '/content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth', 'epoches': 120, 'last_epoch': -1, 'lrsheduler': 'flatcosine', 'lr_gamma': 1, 'no_aug_epoch': 0, 'warmup_iter': 2000, 'flat_epoch': 4000000, 'use_amp': True, 'use_ema': True, 'ema_decay': 0.9999, 'ema_warmups': 2000, 'sync_bn': True, 'clip_max_norm': 0.1, 'find_unused_parameters': False, 'seed': 0, 'print_freq': 100, 'checkpoint_freq': 4, 'output_dir': './output/rtdetrv2_r18vd_120e_coco', 'summary_dir': None, 'device': '', 'yaml_cfg': {'task': 'detection', 'evaluator': {'type': 'CocoEvaluator', 'iou_types': ['bbox']}, 'num_classes': 80, 'remap_mscoco_category': True, 'train_dataloader': {'type': 'DataLoader', 'dataset': {'type': 'CocoDetection', 'img_folder': '/datassd/COCO/train2017/', 'ann_file': '/datassd/COCO/annotations/instances_train2017.json', 'return_masks': False, 'transforms': {'type': 'Compose', 'ops': [{'type': 'RandomPhotometricDistort', 'p': 0.5}, {'type': 'RandomZoomOut', 'fill': 0}, {'type': 'RandomIoUCrop', 'p': 0.8}, {'type': 'SanitizeBoundingBoxes', 'min_size': 1}, {'type': 'RandomHorizontalFlip'}, {'type': 'Resize', 'size': [640, 640]}, {'type': 'SanitizeBoundingBoxes', 'min_size': 1}, {'type': 'ConvertPILImage', 'dtype': 'float32', 'scale': True}, {'type': 'ConvertBoxes', 'fmt': 'cxcywh', 'normalize': True}], 'policy': {'name': 'stop_epoch', 'epoch': 117, 'ops': ['Mosaic', 'RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop']}}}, 'shuffle': True, 'num_workers': 4, 'drop_last': True, 'collate_fn': {'type': 'BatchImageCollateFunction', 'base_size': 640, 'base_size_repeat': 3, 'stop_epoch': 72, 'scales': None}, 'total_batch_size': 16}, 'val_dataloader': {'type': 'DataLoader', 'dataset': {'type': 'CocoDetection', 'img_folder': '/datassd/COCO/val2017/', 'ann_file': '/datassd/COCO/annotations/instances_val2017.json', 'return_masks': False, 'transforms': {'type': 'Compose', 'ops': [{'type': 'Resize', 'size': [640, 640]}, {'type': 'ConvertPILImage', 'dtype': 'float32', 'scale': True}]}}, 'shuffle': False, 'num_workers': 4, 'drop_last': False, 'collate_fn': {'type': 'BatchImageCollateFunction'}, 'total_batch_size': 8}, 'print_freq': 100, 'output_dir': './output/rtdetrv2_r18vd_120e_coco', 'checkpoint_freq': 4, 'sync_bn': True, 'find_unused_parameters': False, 'use_amp': True, 'scaler': {'type': 'GradScaler', 'enabled': True}, 'use_ema': True, 'ema': {'type': 'ModelEMA', 'decay': 0.9999, 'warmups': 2000, 'start': 0}, 'epoches': 120, 'clip_max_norm': 0.1, 'optimizer': {'type': 'AdamW', 'params': [{'params': '^(?=.*(?:norm|bn)).*$', 'weight_decay': 0.0}], 'lr': 0.0001, 'betas': [0.9, 0.999], 'weight_decay': 0.0001}, 'lr_scheduler': {'type': 'MultiStepLR', 'milestones': [1000], 'gamma': 0.1}, 'lr_warmup_scheduler': {'type': 'LinearWarmup', 'warmup_duration': 2000}, 'model': 'DEIM', 'criterion': 'DEIMCriterion', 'postprocessor': 'PostProcessor', 'use_focal_loss': True, 'eval_spatial_size': [640, 640], 'DEIM': {'backbone': 'PResNet', 'encoder': 'HybridEncoder', 'decoder': 'RTDETRTransformerv2'}, 'lrsheduler': 'flatcosine', 'lr_gamma': 1, 'warmup_iter': 2000, 'flat_epoch': 4000000, 'no_aug_epoch': 0, 'PResNet': {'depth': 18, 'variant': 'd', 'freeze_at': -1, 'return_idx': [1, 2, 3], 'num_stages': 4, 'freeze_norm': False, 'pretrained': True, 'local_model_dir': '../RT-DETR-main/rtdetrv2_pytorch/INK1k/'}, 'HybridEncoder': {'in_channels': [128, 256, 512], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'use_encoder_idx': [2], 'num_encoder_layers': 1, 'nhead': 8, 'dim_feedforward': 1024, 'dropout': 0.0, 'enc_act': 'gelu', 'expansion': 0.5, 'depth_mult': 1, 'act': 'silu', 'version': 'rt_detrv2'}, 'RTDETRTransformerv2': {'feat_channels': [256, 256, 256], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'num_levels': 3, 'num_layers': 3, 'num_queries': 300, 'num_denoising': 100, 'label_noise_ratio': 0.5, 'box_noise_scale': 1.0, 'eval_idx': -1, 'num_points': [4, 4, 4], 'cross_attn_method': 'default', 'query_select_method': 'default'}, 'PostProcessor': {'num_top_queries': 300}, 'DEIMCriterion': {'weight_dict': {'loss_vfl': 1, 'loss_bbox': 5, 'loss_giou': 2}, 'losses': ['vfl', 'boxes'], 'alpha': 0.75, 'gamma': 2.0, 'use_uni_set': False, 'matcher': {'type': 'HungarianMatcher', 'weight_dict': {'cost_class': 2, 'cost_bbox': 5, 'cost_giou': 2}, 'alpha': 0.25, 'gamma': 2.0}}, '__include__': ['../dataset/coco_detection.yml', '../runtime.yml', '../base/dataloader.yml', '../base/rt_optimizer.yml', '../base/rtdetrv2_r50vd.yml'], 'config': '/content/DEIM/configs/deim_rtdetrv2/rtdetrv2_r18vd_120e_coco.yml', 'tuning': '/content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth', 'seed': 0, 'test_only': False, 'print_method': 'builtin', 'print_rank': 0}}
/content/DEIM/DEIM/../engine/backbone/ FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state = torch.load(model_path, map_location='cpu')
Loaded PResNet18 from local file@../RT-DETR-main/rtdetrv2_pytorch/INK1k/ResNet18_vd_pretrained_from_paddle.pth.
Load PResNet18 state_dict
Tuning checkpoint from /content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth
/content/DEIM/DEIM/../engine/solver/ FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state = torch.load(path, map_location='cpu')
Load model.state_dict, {'missed': [], 'unmatched': ['decoder.query_pos_head.layers.0.weight', 'decoder.query_pos_head.layers.0.bias', 'decoder.query_pos_head.layers.1.weight']}
/content/DEIM/DEIM/../engine/core/ FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  return module(**module_kwargs)
Initial lr: [0.0001, 0.0001]
building train_dataloader with batch_size=16...
     ### Transform @RandomPhotometricDistort ###    
     ### Transform @RandomZoomOut ###    
     ### Transform @RandomIoUCrop ###    
     ### Transform @SanitizeBoundingBoxes ###    
     ### Transform @RandomHorizontalFlip ###    
     ### Transform @Resize ###    
     ### Transform @SanitizeBoundingBoxes ###    
     ### Transform @ConvertPILImage ###    
     ### Transform @ConvertBoxes ###    
     ### ImgTransforms Epochs: 117 ### 
     ### Policy_ops@['Mosaic', 'RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop'] ###
[rank0]: Traceback (most recent call last):
[rank0]:   File "/content/DEIM/DEIM/", line 84, in <module>
[rank0]:     main(args)
[rank0]:   File "/content/DEIM/DEIM/", line 54, in main
[rank0]:   File "/content/DEIM/DEIM/../engine/solver/", line 25, in fit
[rank0]:     self.train()
[rank0]:   File "/content/DEIM/DEIM/../engine/solver/", line 87, in train
[rank0]:     self.cfg.train_dataloader, shuffle=self.cfg.train_dataloader.shuffle
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/DEIM/../engine/core/", line 76, in train_dataloader
[rank0]:     self._train_dataloader = self.build_dataloader('train_dataloader')
[rank0]:                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/DEIM/../engine/core/", line 172, in build_dataloader
[rank0]:     loader = create(name, global_cfg, batch_size=bs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/DEIM/../engine/core/", line 119, in create
[rank0]:     return create(name, global_cfg)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/DEIM/../engine/core/", line 167, in create
[rank0]:     module_kwargs[k] = create(name, global_cfg)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/DEIM/../engine/core/", line 180, in create
[rank0]:     return module(**module_kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/DEIM/../engine/data/dataset/", line 33, in __init__
[rank0]:     super(CocoDetection, self).__init__(img_folder, ann_file)
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torchvision/datasets/", line 37, in __init__
[rank0]:     self.coco = COCO(annFile)
[rank0]:                 ^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/faster_coco_eval/core/", line 57, in __init__
[rank0]:     self.dataset = self.load_json(annotation_file, self.use_deepcopy)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/faster_coco_eval/core/", line 302, in load_json
[rank0]:     with open(json_file) as io:
[rank0]:          ^^^^^^^^^^^^^^^
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: '/datassd/COCO/annotations/instances_train2017.json'
E0228 11:02:02.741000 8178 torch/distributed/elastic/multiprocessing/] failed (exitcode: 1) local_rank: 0 (pid: 8193) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 10, in <module>
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/", line 355, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/", line 919, in main
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/", line 910, in run
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/", line 269, in launch_agent
    raise ChildFailedError(
============================================================ FAILED
Root Cause (first observed failure):
  time      : 2025-02-28_11:02:02
  host      : 059ca17322f9
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 8193)
  error_file: <N/A>
  traceback : To enable traceback see:

Copy link

Issue changed. I went to my own device to test training, I had to download COCO dataset for this which was really a hassle since the dataset is 16GB+, and I think that was only for the "train" images. So after downloading, I seem to be able to move to the next section but now, I am also getting "NotImplementedError", it seems to be the same issue from this:

Training command:
python -c "deim_dfine\deim_hgnetv2_s_coco.yml" --use-amp --seed=0 -d cpu -t "deim_dfine_hgnetv2_s_coco_120e.pth"

The output is below:

Not init distributed mode.
cfg:  {'task': 'detection', '_model': None, '_postprocessor': None, '_criterion': None, '_optimizer': None, '_lr_scheduler': None, '_lr_warmup_scheduler': None, '_train_dataloader': None, '_val_dataloader': None, '_ema': None, '_scaler': None, '_train_dataset': None, '_val_dataset': None, '_collate_fn': None, '_evaluator': None, '_writer': None, 'num_workers': 0, 'batch_size': None, '_train_batch_size': None, '_val_batch_size': None, '_train_shuffle': None, '_val_shuffle': None, 'resume': None, 'tuning': 'C:\\DEIM-D-FINE_models_config\\S_DEIM-DEFINE\\deim_dfine_hgnetv2_s_coco_120e.pth', 'epoches': 132, 'last_
epoch': -1, 'lrsheduler': 'flatcosine', 'lr_gamma': 0.5, 'no_aug_epoch': 12, 'warmup_iter': 2000, 'flat_epoch': 64, 'use_amp': True, 'use_ema': True, 'ema
_decay': 0.9999, 'ema_warmups': 2000, 'sync_bn': True, 'clip_max_norm': 0.1, 'find_unused_parameters': False, 'seed': 0, 'print_freq': 100, 'checkpoint_fr
eq': 4, 'output_dir': './outputs/deim_hgnetv2_s_coco', 'summary_dir': None, 'device': 'cpu', 'yaml_cfg': {'task': 'detection', 'evaluator': {'type': 'Coco
Evaluator', 'iou_types': ['bbox']}, 'num_classes': 80, 'remap_mscoco_category': False, 'train_dataloader': {'type': 'DataLoader', 'dataset': {'type': 'Coc
oDetection', 'img_folder': 'C:/COCO/train2017/', 'ann_file': 'C:/COCO/annotations/instances_train2017.json', 'return_masks': False, 'transforms': {'type':
 'Compose', 'ops': [{'type': 'Mosaic', 'output_size': 320, 'rotation_range': 10, 'translation_range': [0.1, 0.1], 'scaling_range': [0.5, 1.5], 'probabilit
y': 1.0, 'fill_value': 0, 'use_cache': False, 'max_cached_images': 50, 'random_pop': True}, {'type': 'RandomPhotometricDistort', 'p': 0.5}, {'type': 'Rand
omZoomOut', 'fill': 0}, {'type': 'RandomIoUCrop', 'p': 0.8}, {'type': 'SanitizeBoundingBoxes', 'min_size': 1}, {'type': 'RandomHorizontalFlip'}, {'type': 
'Resize', 'size': [640, 640]}, {'type': 'SanitizeBoundingBoxes', 'min_size': 1}, {'type': 'ConvertPILImage', 'dtype': 'float32', 'scale': True}, {'type': 
'ConvertBoxes', 'fmt': 'cxcywh', 'normalize': True}], 'policy': {'name': 'stop_epoch', 'epoch': [4, 64, 120], 'ops': ['Mosaic', 'RandomPhotometricDistort'
, 'RandomZoomOut', 'RandomIoUCrop']}, 'mosaic_prob': 0.5}}, 'shuffle': True, 'num_workers': 4, 'drop_last': True, 'collate_fn': {'type': 'BatchImageCollat
eFunction', 'base_size': 640, 'base_size_repeat': 20, 'stop_epoch': 120, 'ema_restart_decay': 0.9999, 'mixup_prob': 0.5, 'mixup_epochs': [4, 64]}, 'total_
batch_size': 1}, 'val_dataloader': {'type': 'DataLoader', 'dataset': {'type': 'CocoDetection', 'img_folder': 'C:/COCO/val2017/', 'ann_file': 'C:/COCO/anno
tations/instances_val2017.json', 'return_masks': False, 'transforms': {'type': 'Compose', 'ops': [{'type': 'Resize', 'size': [640, 640]}, {'type': 'Conver
tPILImage', 'dtype': 'float32', 'scale': True}]}}, 'shuffle': False, 'num_workers': 4, 'drop_last': False, 'collate_fn': {'type': 'BatchImageCollateFuncti
on'}, 'total_batch_size': 1}, 'print_freq': 100, 'output_dir': './outputs/deim_hgnetv2_s_coco', 'checkpoint_freq': 4, 'sync_bn': True, 'find_unused_parame
ters': False, 'use_amp': True, 'scaler': {'type': 'GradScaler', 'enabled': True}, 'use_ema': True, 'ema': {'type': 'ModelEMA', 'decay': 0.9999, 'warmups':
 1000, 'start': 0}, 'epoches': 132, 'clip_max_norm': 0.1, 'optimizer': {'type': 'AdamW', 'params': [{'params': '^(?=.*backbone)(?!.*bn).*$', 'lr': 0.0002}
, {'params': '^(?=.*(?:norm|bn)).*$', 'weight_decay': 0.0}], 'lr': 0.0004, 'betas': [0.9, 0.999], 'weight_decay': 0.0001}, 'lr_scheduler': {'type': 'Multi
StepLR', 'milestones': [500], 'gamma': 0.1}, 'lr_warmup_scheduler': {'type': 'LinearWarmup', 'warmup_duration': 500}, 'model': 'DEIM', 'criterion': 'DEIMC
riterion', 'postprocessor': 'PostProcessor', 'use_focal_loss': True, 'eval_spatial_size': [640, 640], 'DEIM': {'backbone': 'HGNetv2', 'encoder': 'HybridEn
coder', 'decoder': 'DFINETransformer'}, 'lrsheduler': 'flatcosine', 'lr_gamma': 0.5, 'warmup_iter': 2000, 'flat_epoch': 64, 'no_aug_epoch': 12, 'HGNetv2':
 {'pretrained': False, 'local_model_dir': '../RT-DETR-main/D-FINE/weight/hgnetv2/', 'name': 'B0', 'return_idx': [1, 2, 3], 'freeze_at': -1, 'freeze_norm':
 False, 'use_lab': True}, 'HybridEncoder': {'in_channels': [256, 512, 1024], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'use_encoder_idx': [2], 'num_
encoder_layers': 1, 'nhead': 8, 'dim_feedforward': 1024, 'dropout': 0.0, 'enc_act': 'gelu', 'expansion': 0.5, 'depth_mult': 0.34, 'act': 'silu'}, 'DFINETr
ansformer': {'feat_channels': [256, 256, 256], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'num_levels': 3, 'num_layers': 3, 'eval_idx': -1, 'num_quer
ies': 300, 'num_denoising': 100, 'label_noise_ratio': 0.5, 'box_noise_scale': 1.0, 'reg_max': 32, 'reg_scale': 4, 'layer_scale': 1, 'num_points': [3, 6, 3
], 'cross_attn_method': 'default', 'query_select_method': 'default', 'activation': 'silu', 'mlp_act': 'silu'}, 'PostProcessor': {'num_top_queries': 300}, 
'DEIMCriterion': {'weight_dict': {'loss_vfl': 1, 'loss_bbox': 5, 'loss_giou': 2, 'loss_fgl': 0.15, 'loss_ddf': 1.5, 'loss_mal': 1}, 'losses': ['mal', 'box
es', 'local'], 'alpha': 0.75, 'gamma': 1.5, 'reg_max': 32, 'matcher': {'type': 'HungarianMatcher', 'weight_dict': {'cost_class': 2, 'cost_bbox': 5, 'cost_
giou': 2}, 'alpha': 0.25, 'gamma': 2.0}}, '__include__': ['./dfine_hgnetv2_s_coco.yml', '../base/deim.yml'], 'config': 'C:\\Users\\griff\\PycharmProjects\
\DEIM\\DEIM\\configs\\deim_dfine\\deim_hgnetv2_s_coco.yml', 'tuning': 'C:\\DEIM-D-FINE_models_config\\S_DEIM-DEFINE\\deim_dfine_hgnetv2_s_coco_120e.pth', 'device': 'cpu', 'seed': 0, 'test_only': False, 'print_method': 'builtin', 'print_rank': 0}}
Tuning checkpoint from C:\DEIM-D-FINE_models_config\S_DEIM-DEFINE\deim_dfine_hgnetv2_s_coco_120e.pth
Load model.state_dict, {'missed': [], 'unmatched': []}
DEIM\engine\core\ FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  return module(**module_kwargs)
.venv\Lib\site-packages\torch\amp\ UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
Initial lr: [0.0002, 0.0004, 0.0004]
building train_dataloader with batch_size=1...
     ### Transform @Mosaic ###
     ### Transform @RandomPhotometricDistort ###
     ### Transform @RandomZoomOut ###
     ### Transform @RandomIoUCrop ###
     ### Transform @SanitizeBoundingBoxes ###
     ### Transform @RandomHorizontalFlip ###
     ### Transform @Resize ###
     ### Transform @SanitizeBoundingBoxes ###
     ### Transform @ConvertPILImage ###
     ### Transform @ConvertBoxes ###
     ### Mosaic with [email protected] and ZoomOut/IoUCrop existed ###
     ### ImgTransforms Epochs: [4, 64, 120] ###
     ### Policy_ops@['Mosaic', 'RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop'] ###
     ### Using MixUp with [email protected] in [4, 64] epochs ### 
     ### Multi-scale Training until 120 epochs ###
     ### Multi-scales@ [480, 512, 544, 576, 608, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 640, 800, 768, 736, 704, 672] ###
building val_dataloader with batch_size=1...
     ### Transform @Resize ###
     ### Transform @ConvertPILImage ###

------------------------------------- Calculate Flops Results -------------------------------------
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  10.24 M
fwd MACs:                                                               12.5323 GMACs
fwd FLOPs:                                                              25.1714 GFLOPS
fwd+bwd MACs:                                                           37.5969 GMACs
fwd+bwd FLOPs:                                                          75.5141 GFLOPS
{'Model FLOPs:25.1714 GFLOPS   MACs:12.5323 GMACs   Params:10237491'}
------------------------------------------Start training-------------------------------------------
     ## Using Self-defined Scheduler-flatcosine ##
[0.0002, 0.0004, 0.0004] [0.0001, 0.0002, 0.0002] 15613884 2000 7570368 1419444
number of trainable parameters: 10321875
Traceback (most recent call last):
  File "DEIM\", line 84, in <module>
  File "DEIM\", line 54, in main
  File "DEIM\engine\solver\", line 76, in fit
    train_stats = train_one_epoch(
  File "DEIM\engine\solver\", line 42, in train_one_epoch
    for i, (samples, targets) in enumerate(metric_logger.log_every(data_loader, print_freq, header)):
  File "DEIM\engine\misc\", line 215, in log_every
    for obj in iterable:
  File ".venv\Lib\site-packages\torch\utils\data\", line 708, in __next__
    data = self._next_data()
  File ".venv\Lib\site-packages\torch\utils\data\", line 1480, in _next_data
    return self._process_data(data)
  File ".venv\Lib\site-packages\torch\utils\data\", line 1505, in _process_data
  File ".venv\Lib\site-packages\torch\", line 733, in reraise
    raise exception
NotImplementedError: Caught NotImplementedError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File ".venv\Lib\site-packages\torch\utils\data\_utils\", line 349, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File ".venv\Lib\site-packages\torch\utils\data\_utils\", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File ".venv\Lib\site-packages\torch\utils\data\_utils\", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "DEIM\engine\data\dataset\", line 44, in __getitem__
    img, target, _ = self._transforms(img, target, self)
  File ".venv\Lib\site-packages\torch\nn\modules\", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File ".venv\Lib\site-packages\torch\nn\modules\", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "DEIM\engine\data\transforms\", line 58, in forward
    return self.get_forward(self.policy['name'])(*inputs)
  File "DEIM\engine\data\transforms\", line 100, in stop_epoch_forward
    sample = transform(sample)
  File ".venv\Lib\site-packages\torch\nn\modules\", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File ".venv\Lib\site-packages\torch\nn\modules\", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File ".venv\Lib\site-packages\torchvision\transforms\v2\", line 68, in forward
    flat_outputs = [
  File ".venv\Lib\site-packages\torchvision\transforms\v2\", line 69, in <listcomp>
    self.transform(inpt, params) if needs_transform else inpt
  File ".venv\Lib\site-packages\torchvision\transforms\v2\", line 55, in transform
    raise NotImplementedError

@LeMosquitar LeMosquitar changed the title HOW TO: Training in Google Colab (Single T4) Issue HOW TO: Training in Google Colab (Single T4) and "NotImplementedError" Mar 1, 2025
Copy link

i have installed a lower torchvision version. which should fix the last problem.
and the problem that the coco dataset is required can be fixed by removing '../dataset/coco_detection.yml', form the dfine_hgnetv2_s_coco.yml config:
__include__: [ '../dataset/coco_detection.yml', '../runtime.yml', '../base/dataloader.yml', '../base/optimizer.yml', '../base/dfine_hgnetv2.yml', ]

Copy link

LeMosquitar commented Mar 4, 2025

Thank you. I did your recommendations, I even downgraded my torchvision down to 0.15.0 but I encountered an error in which it says >= 0.15.2. So I installed 0.15.2. After that, I restarted my runtime to make sure everything is reloaded.

I first ran an inference using the s model just to make sure everything is working and installed properly. I got a torch_results.jpg with detection overlaid. So everything seems fine. So proceed to configuration:

  1. In my custom_detection.yml, I only changed the paths to the image and to my .json file datasets. They are in MS COCO format:
├── images/
│   ├── train/
│   │   ├── image1.jpg
│   │   ├── image2.jpg
│   │   └── ...
│   ├── val/
│   │   ├── image1.jpg
│   │   ├── image2.jpg
│   │   └── ...
└── annotations/
    ├── instances_train.json
    ├── instances_val.json
    └── ...
  1. I want to train the s version of DEIM, so I went to define_hgnetv2_coco.yml and removed '../dataset/coco_detection.yml', as suggested in

  2. In dataloader.yml, I set total_batch_size: 2 for both train and val

  3. Those are all my changes, I left others untouched.

I then ran the training command:

!CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 "/content/DEIM/" -c "/content/DEIM/configs/deim_dfine/deim_hgnetv2_s_coco.yml" --use-amp --seed=0 -t "/content/deim_dfine_hgnetv2_s_coco_120e.pth"

The result:

2025-03-04 03:16:43.241175: E external/local_xla/xla/stream_executor/cuda/] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1741058203.264250    8736] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741058203.270997    8736] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-04 03:16:43.294117: I tensorflow/core/platform/] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Initialized distributed mode...
cfg:  {'task': 'detection', '_model': None, '_postprocessor': None, '_criterion': None, '_optimizer': None, '_lr_scheduler': None, '_lr_warmup_scheduler': None, '_train_dataloader': None, '_val_dataloader': None, '_ema': None, '_scaler': None, '_train_dataset': None, '_val_dataset': None, '_collate_fn': None, '_evaluator': None, '_writer': None, 'num_workers': 0, 'batch_size': None, '_train_batch_size': None, '_val_batch_size': None, '_train_shuffle': None, '_val_shuffle': None, 'resume': None, 'tuning': '/content/deim_dfine_hgnetv2_s_coco_120e.pth', 'epoches': 132, 'last_epoch': -1, 'lrsheduler': 'flatcosine', 'lr_gamma': 0.5, 'no_aug_epoch': 12, 'warmup_iter': 2000, 'flat_epoch': 64, 'use_amp': True, 'use_ema': True, 'ema_decay': 0.9999, 'ema_warmups': 2000, 'sync_bn': True, 'clip_max_norm': 0.1, 'find_unused_parameters': False, 'seed': 0, 'print_freq': 100, 'checkpoint_freq': 4, 'output_dir': './outputs/deim_hgnetv2_s_coco', 'summary_dir': None, 'device': '', 'yaml_cfg': {'print_freq': 100, 'output_dir': './outputs/deim_hgnetv2_s_coco', 'checkpoint_freq': 4, 'sync_bn': True, 'find_unused_parameters': False, 'use_amp': True, 'scaler': {'type': 'GradScaler', 'enabled': True}, 'use_ema': True, 'ema': {'type': 'ModelEMA', 'decay': 0.9999, 'warmups': 1000, 'start': 0}, 'train_dataloader': {'dataset': {'transforms': {'ops': [{'type': 'Mosaic', 'output_size': 320, 'rotation_range': 10, 'translation_range': [0.1, 0.1], 'scaling_range': [0.5, 1.5], 'probability': 1.0, 'fill_value': 0, 'use_cache': False, 'max_cached_images': 50, 'random_pop': True}, {'type': 'RandomPhotometricDistort', 'p': 0.5}, {'type': 'RandomZoomOut', 'fill': 0}, {'type': 'RandomIoUCrop', 'p': 0.8}, {'type': 'SanitizeBoundingBoxes', 'min_size': 1}, {'type': 'RandomHorizontalFlip'}, {'type': 'Resize', 'size': [640, 640]}, {'type': 'SanitizeBoundingBoxes', 'min_size': 1}, {'type': 'ConvertPILImage', 'dtype': 'float32', 'scale': True}, {'type': 'ConvertBoxes', 'fmt': 'cxcywh', 'normalize': True}], 'policy': {'name': 'stop_epoch', 'epoch': [4, 64, 120], 'ops': ['Mosaic', 'RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop']}, 'mosaic_prob': 0.5}}, 'collate_fn': {'type': 'BatchImageCollateFunction', 'base_size': 640, 'base_size_repeat': 20, 'stop_epoch': 120, 'ema_restart_decay': 0.9999, 'mixup_prob': 0.5, 'mixup_epochs': [4, 64]}, 'shuffle': True, 'total_batch_size': 2, 'num_workers': 4}, 'val_dataloader': {'dataset': {'transforms': {'ops': [{'type': 'Resize', 'size': [640, 640]}, {'type': 'ConvertPILImage', 'dtype': 'float32', 'scale': True}]}}, 'shuffle': False, 'total_batch_size': 2, 'num_workers': 4}, 'epoches': 132, 'clip_max_norm': 0.1, 'optimizer': {'type': 'AdamW', 'params': [{'params': '^(?=.*backbone)(?!.*bn).*$', 'lr': 0.0002}, {'params': '^(?=.*(?:norm|bn)).*$', 'weight_decay': 0.0}], 'lr': 0.0004, 'betas': [0.9, 0.999], 'weight_decay': 0.0001}, 'lr_scheduler': {'type': 'MultiStepLR', 'milestones': [500], 'gamma': 0.1}, 'lr_warmup_scheduler': {'type': 'LinearWarmup', 'warmup_duration': 500}, 'task': 'detection', 'model': 'DEIM', 'criterion': 'DEIMCriterion', 'postprocessor': 'PostProcessor', 'use_focal_loss': True, 'eval_spatial_size': [640, 640], 'DEIM': {'backbone': 'HGNetv2', 'encoder': 'HybridEncoder', 'decoder': 'DFINETransformer'}, 'lrsheduler': 'flatcosine', 'lr_gamma': 0.5, 'warmup_iter': 2000, 'flat_epoch': 64, 'no_aug_epoch': 12, 'HGNetv2': {'pretrained': False, 'local_model_dir': '../RT-DETR-main/D-FINE/weight/hgnetv2/', 'name': 'B0', 'return_idx': [1, 2, 3], 'freeze_at': -1, 'freeze_norm': False, 'use_lab': True}, 'HybridEncoder': {'in_channels': [256, 512, 1024], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'use_encoder_idx': [2], 'num_encoder_layers': 1, 'nhead': 8, 'dim_feedforward': 1024, 'dropout': 0.0, 'enc_act': 'gelu', 'expansion': 0.5, 'depth_mult': 0.34, 'act': 'silu'}, 'DFINETransformer': {'feat_channels': [256, 256, 256], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'num_levels': 3, 'num_layers': 3, 'eval_idx': -1, 'num_queries': 300, 'num_denoising': 100, 'label_noise_ratio': 0.5, 'box_noise_scale': 1.0, 'reg_max': 32, 'reg_scale': 4, 'layer_scale': 1, 'num_points': [3, 6, 3], 'cross_attn_method': 'default', 'query_select_method': 'default', 'activation': 'silu', 'mlp_act': 'silu'}, 'PostProcessor': {'num_top_queries': 300}, 'DEIMCriterion': {'weight_dict': {'loss_vfl': 1, 'loss_bbox': 5, 'loss_giou': 2, 'loss_fgl': 0.15, 'loss_ddf': 1.5, 'loss_mal': 1}, 'losses': ['mal', 'boxes', 'local'], 'alpha': 0.75, 'gamma': 1.5, 'reg_max': 32, 'matcher': {'type': 'HungarianMatcher', 'weight_dict': {'cost_class': 2, 'cost_bbox': 5, 'cost_giou': 2}, 'alpha': 0.25, 'gamma': 2.0}}, '__include__': ['./dfine_hgnetv2_s_coco.yml', '../base/deim.yml'], 'config': '/content/DEIM/configs/deim_dfine/deim_hgnetv2_s_coco.yml', 'tuning': '/content/deim_dfine_hgnetv2_s_coco_120e.pth', 'seed': 0, 'test_only': False, 'print_method': 'builtin', 'print_rank': 0}}
Tuning checkpoint from /content/deim_dfine_hgnetv2_s_coco_120e.pth
Load model.state_dict, {'missed': [], 'unmatched': []}
Initial lr: [0.0002, 0.0004, 0.0004]
building train_dataloader with batch_size=2...
Traceback (most recent call last):
  File "/content/DEIM/", line 84, in <module>
  File "/content/DEIM/", line 54, in main
  File "/content/DEIM/engine/solver/", line 25, in fit
  File "/content/DEIM/engine/solver/", line 87, in train
    self.cfg.train_dataloader, shuffle=self.cfg.train_dataloader.shuffle
  File "/content/DEIM/engine/core/", line 76, in train_dataloader
    self._train_dataloader = self.build_dataloader('train_dataloader')
  File "/content/DEIM/engine/core/", line 172, in build_dataloader
    loader = create(name, global_cfg, batch_size=bs)
  File "/content/DEIM/engine/core/", line 121, in create
    module = getattr(cfg['_pymodule'], name)
KeyError: '_pymodule'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 8736) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/", line 794, in main
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/", line 785, in run
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/", line 250, in launch_agent
    raise ChildFailedError(
/content/DEIM/ FAILED
Root Cause (first observed failure):
  time      : 2025-03-04_03:16:56
  host      : 81fa70c56262
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 8736)
  error_file: <N/A>
  traceback : To enable traceback see:

I searched for the keyword: KeyError: '_pymodule' in this repo and got the issue ->

The answer provided is to check for paths and configuration which I believe I have already done so. Any recommendations, ideas?

Copy link

maybe try to run it with python instead of torchrun first:
python -c configs/deim_dfine/deim_hgnetv2_s_coco.yml --use-amp --seed=0 -t deim_dfine_hgnetv2_s_coco_120e.pth

Copy link

LeMosquitar commented Mar 5, 2025

I ran using pytorch -> !python -c configs/deim_dfine/deim_hgnetv2_s_coco.yml --use-amp --seed=0 -t /content/deim_dfine_hgnetv2_s_coco_120e.pth

I got same result of: KeyError: '_pymodule'
I tried:

  1. pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url
  2. pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url
  3. and finally -> pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url

Inference works on all versions/installations above.

My JSON files also seem compliant (section of the file in instances_train.json):


I kept getting the same error:

2025-03-05 08:56:13.010289: E external/local_xla/xla/stream_executor/cuda/] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1741164973.031805    7435] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741164973.038648    7435] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-05 08:56:13.060338: I tensorflow/core/platform/] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Not init distributed mode.
cfg:  {'task': 'detection', '_model': None, '_postprocessor': None, '_criterion': None, '_optimizer': None, '_lr_scheduler': None, '_lr_warmup_scheduler': None, '_train_dataloader': None, '_val_dataloader': None, '_ema': None, '_scaler': None, '_train_dataset': None, '_val_dataset': None, '_collate_fn': None, '_evaluator': None, '_writer': None, 'num_workers': 0, 'batch_size': None, '_train_batch_size': None, '_val_batch_size': None, '_train_shuffle': None, '_val_shuffle': None, 'resume': None, 'tuning': '/content/deim_dfine_hgnetv2_s_coco_120e.pth', 'epoches': 132, 'last_epoch': -1, 'lrsheduler': 'flatcosine', 'lr_gamma': 0.5, 'no_aug_epoch': 12, 'warmup_iter': 2000, 'flat_epoch': 64, 'use_amp': True, 'use_ema': True, 'ema_decay': 0.9999, 'ema_warmups': 2000, 'sync_bn': True, 'clip_max_norm': 0.1, 'find_unused_parameters': False, 'seed': 0, 'print_freq': 100, 'checkpoint_freq': 4, 'output_dir': './outputs/deim_hgnetv2_s_coco', 'summary_dir': None, 'device': '', 'yaml_cfg': {'print_freq': 100, 'output_dir': './outputs/deim_hgnetv2_s_coco', 'checkpoint_freq': 4, 'sync_bn': True, 'find_unused_parameters': False, 'use_amp': True, 'scaler': {'type': 'GradScaler', 'enabled': True}, 'use_ema': True, 'ema': {'type': 'ModelEMA', 'decay': 0.9999, 'warmups': 1000, 'start': 0}, 'train_dataloader': {'dataset': {'transforms': {'ops': [{'type': 'Mosaic', 'output_size': 320, 'rotation_range': 10, 'translation_range': [0.1, 0.1], 'scaling_range': [0.5, 1.5], 'probability': 1.0, 'fill_value': 0, 'use_cache': False, 'max_cached_images': 50, 'random_pop': True}, {'type': 'RandomPhotometricDistort', 'p': 0.5}, {'type': 'RandomZoomOut', 'fill': 0}, {'type': 'RandomIoUCrop', 'p': 0.8}, {'type': 'SanitizeBoundingBoxes', 'min_size': 1}, {'type': 'RandomHorizontalFlip'}, {'type': 'Resize', 'size': [640, 640]}, {'type': 'SanitizeBoundingBoxes', 'min_size': 1}, {'type': 'ConvertPILImage', 'dtype': 'float32', 'scale': True}, {'type': 'ConvertBoxes', 'fmt': 'cxcywh', 'normalize': True}], 'policy': {'name': 'stop_epoch', 'epoch': [4, 64, 120], 'ops': ['Mosaic', 'RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop']}, 'mosaic_prob': 0.5}}, 'collate_fn': {'type': 'BatchImageCollateFunction', 'base_size': 640, 'base_size_repeat': 20, 'stop_epoch': 120, 'ema_restart_decay': 0.9999, 'mixup_prob': 0.5, 'mixup_epochs': [4, 64]}, 'shuffle': True, 'total_batch_size': 1, 'num_workers': 4}, 'val_dataloader': {'dataset': {'transforms': {'ops': [{'type': 'Resize', 'size': [640, 640]}, {'type': 'ConvertPILImage', 'dtype': 'float32', 'scale': True}]}}, 'shuffle': False, 'total_batch_size': 1, 'num_workers': 4}, 'epoches': 132, 'clip_max_norm': 0.1, 'optimizer': {'type': 'AdamW', 'params': [{'params': '^(?=.*backbone)(?!.*bn).*$', 'lr': 0.0002}, {'params': '^(?=.*(?:norm|bn)).*$', 'weight_decay': 0.0}], 'lr': 0.0004, 'betas': [0.9, 0.999], 'weight_decay': 0.0001}, 'lr_scheduler': {'type': 'MultiStepLR', 'milestones': [500], 'gamma': 0.1}, 'lr_warmup_scheduler': {'type': 'LinearWarmup', 'warmup_duration': 500}, 'task': 'detection', 'model': 'DEIM', 'criterion': 'DEIMCriterion', 'postprocessor': 'PostProcessor', 'use_focal_loss': True, 'eval_spatial_size': [640, 640], 'DEIM': {'backbone': 'HGNetv2', 'encoder': 'HybridEncoder', 'decoder': 'DFINETransformer'}, 'lrsheduler': 'flatcosine', 'lr_gamma': 0.5, 'warmup_iter': 2000, 'flat_epoch': 64, 'no_aug_epoch': 12, 'HGNetv2': {'pretrained': False, 'local_model_dir': '../RT-DETR-main/D-FINE/weight/hgnetv2/', 'name': 'B0', 'return_idx': [1, 2, 3], 'freeze_at': -1, 'freeze_norm': False, 'use_lab': True}, 'HybridEncoder': {'in_channels': [256, 512, 1024], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'use_encoder_idx': [2], 'num_encoder_layers': 1, 'nhead': 8, 'dim_feedforward': 1024, 'dropout': 0.0, 'enc_act': 'gelu', 'expansion': 0.5, 'depth_mult': 0.34, 'act': 'silu'}, 'DFINETransformer': {'feat_channels': [256, 256, 256], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'num_levels': 3, 'num_layers': 3, 'eval_idx': -1, 'num_queries': 300, 'num_denoising': 100, 'label_noise_ratio': 0.5, 'box_noise_scale': 1.0, 'reg_max': 32, 'reg_scale': 4, 'layer_scale': 1, 'num_points': [3, 6, 3], 'cross_attn_method': 'default', 'query_select_method': 'default', 'activation': 'silu', 'mlp_act': 'silu'}, 'PostProcessor': {'num_top_queries': 300}, 'DEIMCriterion': {'weight_dict': {'loss_vfl': 1, 'loss_bbox': 5, 'loss_giou': 2, 'loss_fgl': 0.15, 'loss_ddf': 1.5, 'loss_mal': 1}, 'losses': ['mal', 'boxes', 'local'], 'alpha': 0.75, 'gamma': 1.5, 'reg_max': 32, 'matcher': {'type': 'HungarianMatcher', 'weight_dict': {'cost_class': 2, 'cost_bbox': 5, 'cost_giou': 2}, 'alpha': 0.25, 'gamma': 2.0}}, '__include__': ['./dfine_hgnetv2_s_coco.yml', '../base/deim.yml'], 'config': 'configs/deim_dfine/deim_hgnetv2_s_coco.yml', 'tuning': '/content/deim_dfine_hgnetv2_s_coco_120e.pth', 'seed': 0, 'test_only': False, 'print_method': 'builtin', 'print_rank': 0}}
Tuning checkpoint from /content/deim_dfine_hgnetv2_s_coco_120e.pth
Load model.state_dict, {'missed': [], 'unmatched': []}
Initial lr: [0.0002, 0.0004, 0.0004]
building train_dataloader with batch_size=1...
Traceback (most recent call last):
  File "/content/DEIM/", line 84, in <module>
  File "/content/DEIM/", line 54, in main
  File "/content/DEIM/engine/solver/", line 25, in fit
  File "/content/DEIM/engine/solver/", line 87, in train
    self.cfg.train_dataloader, shuffle=self.cfg.train_dataloader.shuffle
  File "/content/DEIM/engine/core/", line 76, in train_dataloader
    self._train_dataloader = self.build_dataloader('train_dataloader')
  File "/content/DEIM/engine/core/", line 172, in build_dataloader
    loader = create(name, global_cfg, batch_size=bs)
  File "/content/DEIM/engine/core/", line 121, in create
    module = getattr(cfg['_pymodule'], name)
KeyError: '_pymodule'

My custom dataset is from Roboflow exported in COCO format. I restructured and renamed the files to comply with the stated file structure in README as well as MSCOCO2017 for my annotation json files they use the format above. I really want to use this model for a project but have to finetune it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

No branches or pull requests

2 participants