-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTX v2.1 detection train: "RuntimeError: Cannot obtain best_confidence_threshold updated previously. Please execute self.update(best_confidence_threshold=None) first." #3311
Comments
#3302 will fix this |
#3302 is merged. @goodsong81 Could you check? |
We've got a different error. Did I something wrong? $ otx train --config src/otx/recipe/detection/yolox_tiny.yaml --data_root tests/assets/car_tree_bug --work_dir /tmp/yolox --metric otx.core.metrics.fmeasure.FMeasureCallable --callback_monitor val/f1-score --model.scheduler.monitor val/f1-score --max_epochs 1 --seed 0
...
Traceback (most recent call last):
File "/home/songkich/miniconda3/envs/otx-v2/bin/otx", line 8, in <module>
sys.exit(main())
^^^^^^
File "/mnt/sdb/workarea/otx/src/otx/cli/__init__.py", line 17, in main
OTXCLI()
File "/mnt/sdb/workarea/otx/src/otx/cli/cli.py", line 59, in __init__
self.run()
File "/mnt/sdb/workarea/otx/src/otx/cli/cli.py", line 522, in run
fn(**fn_kwargs)
File "/mnt/sdb/workarea/otx/src/otx/engine/engine.py", line 286, in train
self.model.load_state_dict_incrementally(ckpt)
File "/mnt/sdb/workarea/otx/src/otx/core/model/base.py", line 369, in load_state_dict_incrementally
raise ValueError(msg, ckpt_label_info)
ValueError: ('Checkpoint should have `label_info`.', None) |
@vinnamkim Could you take a look at this? |
In my local (bec2046), it works (otx-v2) vinnamki@vinnamki:~/otx/training_extensions$ otx train --config src/otx/recipe/detection/yolox_tiny.yaml --data_root tests/assets/car_tree_bug --work_dir /tmp/yolox --metric otx.core.metrics.fmeasure.FMeasureCallable --callback_monitor val/f1-score --model.scheduler.monitor val/f1-score --max_epochs 1 --seed 0
/home/vinnamki/otx/training_extensions/src/otx/cli/cli.py:274: UserWarning: Load default config from /tmp/yolox/.latest/train/configs.yaml.
warn(f"Load default config from {self.cache_dir / 'configs.yaml'}.", stacklevel=0)
/home/vinnamki/otx/training_extensions/src/otx/cli/cli.py:267: UserWarning: Load default checkpoint from /tmp/yolox/.latest/train/checkpoints/epoch_000.ckpt.
warn(f"Load default checkpoint from {latest_checkpoint}.", stacklevel=0)
/home/vinnamki/otx/training_extensions/src/otx/cli/cli.py:274: UserWarning: Load default config from /tmp/yolox/.latest/train/configs.yaml.
warn(f"Load default config from {self.cache_dir / 'configs.yaml'}.", stacklevel=0)
██████╗ ████████╗ ██╗ ██╗
██╔═══██╗ ╚══██╔══╝ ╚██╗██╔╝
██║ ██║ ██║ ╚███╔╝
██║ ██║ ██║ ██╔██╗
╚██████╔╝ ██║ ██╔╝ ██╗
╚═════╝ ╚═╝ ╚═╝ ╚═╝
ver.2.1.0rc0
Seed set to 0
/home/vinnamki/otx/training_extensions/src/otx/core/data/module.py:61: UserWarning: There are empty annotation items in train set, Of these, only 0.0% are used.
dataset = pre_filtering(dataset, self.config.data_format, self.config.unannotated_items_ratio)
/home/vinnamki/otx/training_extensions/src/otx/cli/cli.py:381: UserWarning: The `num_classes` in dataset is 3 but, the `num_classes` of model is 80. So, Update `model.num_classes` to 3.
warn(warning_msg, stacklevel=0)
Loads checkpoint by http backend from path: https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/models/object_detection/v2/yolox_tiny_8x8.pth
The model and loaded state dict do not match exactly
size mismatch for bbox_head.multi_level_conv_cls.0.weight: copying a param with shape torch.Size([80, 96, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 96, 1, 1]).
size mismatch for bbox_head.multi_level_conv_cls.0.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([3]).
size mismatch for bbox_head.multi_level_conv_cls.1.weight: copying a param with shape torch.Size([80, 96, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 96, 1, 1]).
size mismatch for bbox_head.multi_level_conv_cls.1.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([3]).
size mismatch for bbox_head.multi_level_conv_cls.2.weight: copying a param with shape torch.Size([80, 96, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 96, 1, 1]).
size mismatch for bbox_head.multi_level_conv_cls.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([3]).
/home/vinnamki/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/fabric/connector.py:565: `precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
Using 16bit Automatic Mixed Precision (AMP)
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
WARNING:root:Model label_info is not equal to the Datamodule label_info. It will be overriden: LabelInfo(label_names=['label_0', 'label_1', 'label_2'], label_groups=[['label_0', 'label_1', 'label_2']]) => LabelInfo(label_names=['car', 'tree', 'bug'], label_groups=[['car', 'tree', 'bug']])
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Missing logger folder: /tmp/yolox/20240416_051915/csv/
/home/vinnamki/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:639: Checkpoint directory /tmp/yolox/20240416_051915 exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
┏━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┓
┃ ┃ Name ┃ Type ┃ Params ┃
┡━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━┩
│ 0 │ model │ YOLOX │ 5.0 M │
└───┴───────┴───────┴────────┘
Trainable params: 5.0 M
Non-trainable params: 0
Total params: 5.0 M
Total estimated model params size (MB): 20
/home/vinnamki/miniconda3/envs/otx-v2/lib/python3.11/site-packages/torch/utils/data/sampler.py:64: UserWarning: `data_source` argument is not used and will be removed in 2.2.0.You may still have custom implementation that utilizes it.
warnings.warn("`data_source` argument is not used and will be removed in 2.2.0."
/home/vinnamki/miniconda3/envs/otx-v2/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py:293: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
WARNING:root:Trainer.log_every_n_steps is higher than the number of iterations in a training epoch. To ensure logging at the last batch, temporarily update Trainer.log_every_n_steps: 50 => 1
/home/vinnamki/miniconda3/envs/otx-v2/lib/python3.11/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the
indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1699449183005/work/aten/src/ATen/native/TensorShape.cpp:3526.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/home/vinnamki/miniconda3/envs/otx-v2/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`.
In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the
first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Epoch 0/0 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:01 • 0:00:00 0.00it/s v_num: 0 train/loss_cls: 2.606 train/loss_bbox: 2.909 train/loss_obj: 4.665 train/loss: 10.180`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0/0 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:01 • 0:00:00 0.00it/s v_num: 0 train/loss_cls: 2.606 train/loss_bbox: 2.909 train/loss_obj: 4.665 train/loss: 10.180
val/f1-score: 0.000
Elapsed time: 0:00:04.214605 Please clean up |
Cleaning up the work_dir worked. Thanks! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
Steps to Reproduce
On the latest develop (3266f24)
otx train --config src/otx/recipe/detection/yolox_tiny.yaml --data_root tests/assets/car_tree_bug --work_dir /tmp/yolox --metric otx.core.metrics.fmeasure.FMeasureCallable --callback_monitor val/f1-score --model.scheduler.monitor val/f1-score --max_epochs 1 --seed 0
Environment:
The text was updated successfully, but these errors were encountered: