Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTX v2.1 detection train: "RuntimeError: Cannot obtain best_confidence_threshold updated previously. Please execute self.update(best_confidence_threshold=None) first." #3311

Closed
goodsong81 opened this issue Apr 12, 2024 · 6 comments
Assignees
Labels
BUG Something isn't working
Milestone

Comments

@goodsong81
Copy link
Contributor

Describe the bug

Traceback (most recent call last):                                                                                                                                                  [8/1971]  File "/home/songkich/miniconda3/envs/otx-v2/bin/otx", line 8, in <module>
    sys.exit(main())                                                                                                                                                                                     ^^^^^^
  File "/mnt/sdb/workarea/otx/src/otx/cli/__init__.py", line 17, in main                                                                                                                        OTXCLI()
  File "/mnt/sdb/workarea/otx/src/otx/cli/cli.py", line 59, in __init__
    self.run()
  File "/mnt/sdb/workarea/otx/src/otx/cli/cli.py", line 522, in run
    fn(**fn_kwargs)
  File "/mnt/sdb/workarea/otx/src/otx/engine/engine.py", line 280, in train
    self.trainer.fit(
  File "/home/songkich/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/songkich/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/songkich/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/songkich/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/home/songkich/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/home/songkich/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/songkich/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/songkich/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 137, in run
    self.on_advance_end(data_fetcher)
  File "/home/songkich/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 285, in on_advance_end
    self.val_loop.run()
  File "/home/songkich/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/songkich/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 141, in run
    return self.on_run_end()
           ^^^^^^^^^^^^^^^^^
  File "/home/songkich/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 253, in on_run_end
    self._on_evaluation_epoch_end()
  File "/home/songkich/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 329, in _on_evaluation_epoch_end
    call._call_lightning_module_hook(trainer, hook_name)
  File "/home/songkich/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/mnt/sdb/workarea/otx/src/otx/core/model/base.py", line 240, in on_validation_epoch_end
    self._log_metrics(self.metric, "val")
  File "/mnt/sdb/workarea/otx/src/otx/core/model/detection.py", line 152, in _log_metrics
    if best_confidence_threshold := getattr(meter, "best_confidence_threshold", None):
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/sdb/workarea/otx/src/otx/core/metrics/fmeasure.py", line 764, in best_confidence_threshold
    raise RuntimeError(msg)
RuntimeError: Cannot obtain best_confidence_threshold updated previously. Please execute self.update(best_confidence_threshold=None) first.

Steps to Reproduce

On the latest develop (3266f24)

otx train --config src/otx/recipe/detection/yolox_tiny.yaml --data_root tests/assets/car_tree_bug --work_dir /tmp/yolox --metric otx.core.metrics.fmeasure.FMeasureCallable --callback_monitor val/f1-score --model.scheduler.monitor val/f1-score --max_epochs 1 --seed 0

Environment:

  • OS:
  • Framework version:
  • Python version:
  • OpenVINO version:
  • CUDA/cuDNN version:
  • GPU model and memory:
@goodsong81 goodsong81 added the BUG Something isn't working label Apr 12, 2024
@goodsong81 goodsong81 added this to the 2.1.0 milestone Apr 12, 2024
@jaegukhyun
Copy link
Contributor

#3302 will fix this

@jaegukhyun
Copy link
Contributor

#3302 is merged. @goodsong81 Could you check?

@goodsong81
Copy link
Contributor Author

We've got a different error. Did I something wrong?

$ otx train --config src/otx/recipe/detection/yolox_tiny.yaml --data_root tests/assets/car_tree_bug --work_dir /tmp/yolox --metric otx.core.metrics.fmeasure.FMeasureCallable --callback_monitor val/f1-score --model.scheduler.monitor val/f1-score --max_epochs 1 --seed 0
...
Traceback (most recent call last):
  File "/home/songkich/miniconda3/envs/otx-v2/bin/otx", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/mnt/sdb/workarea/otx/src/otx/cli/__init__.py", line 17, in main
    OTXCLI()
  File "/mnt/sdb/workarea/otx/src/otx/cli/cli.py", line 59, in __init__
    self.run()
  File "/mnt/sdb/workarea/otx/src/otx/cli/cli.py", line 522, in run
    fn(**fn_kwargs)
  File "/mnt/sdb/workarea/otx/src/otx/engine/engine.py", line 286, in train
    self.model.load_state_dict_incrementally(ckpt)
  File "/mnt/sdb/workarea/otx/src/otx/core/model/base.py", line 369, in load_state_dict_incrementally
    raise ValueError(msg, ckpt_label_info)
ValueError: ('Checkpoint should have `label_info`.', None)

@jaegukhyun
Copy link
Contributor

@vinnamkim Could you take a look at this?

@vinnamkim
Copy link
Contributor

In my local (bec2046), it works

(otx-v2) vinnamki@vinnamki:~/otx/training_extensions$ otx train --config src/otx/recipe/detection/yolox_tiny.yaml --data_root tests/assets/car_tree_bug --work_dir /tmp/yolox --metric otx.core.metrics.fmeasure.FMeasureCallable --callback_monitor val/f1-score --model.scheduler.monitor val/f1-score --max_epochs 1 --seed 0
/home/vinnamki/otx/training_extensions/src/otx/cli/cli.py:274: UserWarning: Load default config from /tmp/yolox/.latest/train/configs.yaml.
  warn(f"Load default config from {self.cache_dir / 'configs.yaml'}.", stacklevel=0)
/home/vinnamki/otx/training_extensions/src/otx/cli/cli.py:267: UserWarning: Load default checkpoint from /tmp/yolox/.latest/train/checkpoints/epoch_000.ckpt.
  warn(f"Load default checkpoint from {latest_checkpoint}.", stacklevel=0)
/home/vinnamki/otx/training_extensions/src/otx/cli/cli.py:274: UserWarning: Load default config from /tmp/yolox/.latest/train/configs.yaml.
  warn(f"Load default config from {self.cache_dir / 'configs.yaml'}.", stacklevel=0)
                                                                                                                                                                                   
                                                                                                                                                                                   
                                                                            ██████╗  ████████╗ ██╗  ██╗                                                                            
                                                                           ██╔═══██╗ ╚══██╔══╝ ╚██╗██╔╝                                                                            
                                                                           ██║   ██║    ██║     ╚███╔╝                                                                             
                                                                           ██║   ██║    ██║     ██╔██╗                                                                             
                                                                           ╚██████╔╝    ██║    ██╔╝ ██╗                                                                            
                                                                            ╚═════╝     ╚═╝    ╚═╝  ╚═╝                                                                            
                                                                                                                                                                                   
                                                                                   ver.2.1.0rc0                                                                                    
Seed set to 0
/home/vinnamki/otx/training_extensions/src/otx/core/data/module.py:61: UserWarning: There are empty annotation items in train set, Of these, only 0.0% are used.
  dataset = pre_filtering(dataset, self.config.data_format, self.config.unannotated_items_ratio)
/home/vinnamki/otx/training_extensions/src/otx/cli/cli.py:381: UserWarning: The `num_classes` in dataset is 3 but, the `num_classes` of model is 80. So, Update `model.num_classes` to 3.
  warn(warning_msg, stacklevel=0)
Loads checkpoint by http backend from path: https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/models/object_detection/v2/yolox_tiny_8x8.pth
The model and loaded state dict do not match exactly

size mismatch for bbox_head.multi_level_conv_cls.0.weight: copying a param with shape torch.Size([80, 96, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 96, 1, 1]).
size mismatch for bbox_head.multi_level_conv_cls.0.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([3]).
size mismatch for bbox_head.multi_level_conv_cls.1.weight: copying a param with shape torch.Size([80, 96, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 96, 1, 1]).
size mismatch for bbox_head.multi_level_conv_cls.1.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([3]).
size mismatch for bbox_head.multi_level_conv_cls.2.weight: copying a param with shape torch.Size([80, 96, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 96, 1, 1]).
size mismatch for bbox_head.multi_level_conv_cls.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([3]).
/home/vinnamki/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/fabric/connector.py:565: `precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
Using 16bit Automatic Mixed Precision (AMP)
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
WARNING:root:Model label_info is not equal to the Datamodule label_info. It will be overriden: LabelInfo(label_names=['label_0', 'label_1', 'label_2'], label_groups=[['label_0', 'label_1', 'label_2']]) => LabelInfo(label_names=['car', 'tree', 'bug'], label_groups=[['car', 'tree', 'bug']])
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Missing logger folder: /tmp/yolox/20240416_051915/csv/
/home/vinnamki/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:639: Checkpoint directory /tmp/yolox/20240416_051915 exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
┏━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┓
┃   ┃ Name  ┃ Type  ┃ Params ┃
┡━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━┩
│ 0 │ model │ YOLOX │  5.0 M │
└───┴───────┴───────┴────────┘
Trainable params: 5.0 M                                                                                                                                                            
Non-trainable params: 0                                                                                                                                                            
Total params: 5.0 M                                                                                                                                                                
Total estimated model params size (MB): 20                                                                                                                                         
/home/vinnamki/miniconda3/envs/otx-v2/lib/python3.11/site-packages/torch/utils/data/sampler.py:64: UserWarning: `data_source` argument is not used and will be removed in 2.2.0.You may still have custom implementation that utilizes it.
  warnings.warn("`data_source` argument is not used and will be removed in 2.2.0."
/home/vinnamki/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py:293: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
WARNING:root:Trainer.log_every_n_steps is higher than the number of iterations in a training epoch. To ensure logging at the last batch, temporarily update Trainer.log_every_n_steps: 50 => 1
/home/vinnamki/miniconda3/envs/otx-v2/lib/python3.11/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the 
indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1699449183005/work/aten/src/ATen/native/TensorShape.cpp:3526.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/home/vinnamki/miniconda3/envs/otx-v2/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. 
In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the 
first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Epoch 0/0  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:01 • 0:00:00 0.00it/s v_num: 0 train/loss_cls: 2.606 train/loss_bbox: 2.909 train/loss_obj: 4.665 train/loss: 10.180`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0/0  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:01 • 0:00:00 0.00it/s v_num: 0 train/loss_cls: 2.606 train/loss_bbox: 2.909 train/loss_obj: 4.665 train/loss: 10.180  
                                                                                   val/f1-score: 0.000                                                                             
Elapsed time: 0:00:04.214605

Please clean up /tmp/yolox directory and try it again.

@goodsong81
Copy link
Contributor Author

Cleaning up the work_dir worked. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants