Jinrong Yang1 Shengkai Wu1 Lijun Gou1 Hangcheng Yu1 Chenxi Lin1 Jiazhuo Wang1 Pan Wang1 Minxuan Li2 Xiaoping Li1
1State Key Laboratory of Digital Manufacturing Equipment and Technology, Huazhong University of Science and Technology, China.
2Faculty of Arts and Science, Queen’s University, Canada
Carton detection is an important technique in the automatic logistics system and can be applied to many applications such as the stacking and unstacking of cartons, the unloading of cartons in the containers. However, there is no public large-scale carton dataset for the research community to train and evaluate the carton detection models up to now, which hinders the development of carton detection. In this paper, we present a large-scale carton dataset named Stacked Carton Dataset(SCD) with the goal of advancing the state-of-the-art in carton detection. Images are collected from the internet and several warehourses, and objects are labeled using per-instance segmentation for precise localization. There are totally 250,000 instance masks from 16,136 images. In addition, we design a carton detector based on RetinaNet by embedding Boundary Guided Supervision module(BGS) and Offset Prediction between Classification and Localization module(OPCL). OPCL alleviates the imbalance problem between classification and localization quality which boosts AP by 3.1% ~ 4.7% on SCD while BGS guides the detector to pay more attention to boundary information of cartons and decouple repeated carton textures. To demonstrate the generalization of OPCL to other datasets, we conduct extensive experiments on MS COCO and PASCAL VOC. The improvements of AP on MS COCO and PASCAL VOC are 1.8% ~ 2.2% and 3.4% ~ 4.3% respectively.
Example of instance annotation in SCD. The first line represents the style of four labels with respect to LSCD while the second line illustrates the style of one label in OSCD. In terms of first line, blue, green, red and yellow represent Carton-inner-all, Carton-innerocclusion, Carton-outer-al and Carton-outer-occlusion respectively.
Dataset | Images | Split(training/test set) | Labels | All/Occlusion | Inner/Outer | Total Instances | Average Instances |
---|---|---|---|---|---|---|---|
LSCD | 7,735 | 6,735/1,000 | 4&1 | √ | √ | 81,870 | 10.58 |
OSCD | 8,401 | 7,401/1,000 | 1 | × | × | 168,748 | 20.09 |
OSCD:
(1) OSCD => "Images and COCO-style labels" (password: XXXX)
LSCD:
(1) LSCD => "Images and LabelMe-style labels" (password: XXXX)
(2) LSCD => "Images and COCO-style labels(containing Carton-inner-all, Carton-inner-occlusion, Carton-outer-all and Carton-outer-occlusion)" (password: XXXX)
(3) LSCD => "Images and COCO-style labels(only containing carton)" (password: XXXX)
*Notice: You should download the dataset using Baidu Drive. You can email us to request data and clarify your purpose, we will give you the password within 3 days.([email protected], [email protected])
The first line represents the statistical distribution of LSCD while the second line represents the statistical distribution of OSCD. The chart calculates the width, height, aspect ratio, pixel area and the number of objects in each image from left to right. Noting that the width, height and area of instance are all normalized by the width and height of corresponding image. Log function is adopted to normalize aspect ratio.
Dataset | Labels | Model(training/test set) | mAP | AP50 | AP75 |
---|---|---|---|---|---|
OSCD | 1 | RetinaNet | 72.1 | 90.8 | 80.5 |
OSCD | 1 | RetinaNet+ | 76.6 | 91.8 | 83.6 |
OSCD | 1 | FCOS | 72.8 | 91.1 | 80.6 |
OSCD | 1 | Faster R-CNN | 69.0 | 90.1 | 77.8 |
LSCD | 1 | RetinaNet | 79.8 | 95.2 | 87.9 |
LSCD | 1 | RetinaNet+ | 84.7 | 95.8 | 89.8 |
LSCD | 1 | FCOS | 76.5 | 93.7 | 84.3 |
LSCD | 1 | Faster R-CNN | 77.5 | 94.5 | 86.3 |
LSCD | 4 | RetinaNet | 65.7 | 80.4 | 73.0 |
LSCD | 4 | RetinaNet+ | 69.9 | 80.0 | 74.9 |
LSCD | 4 | FCOS | 68.1 | 81.2 | 74.8 |
LSCD | 4 | Faster R-CNN | 61.2 | 79.5 | 70.1 |
LSCD+OSCD | 1 | RetinaNet | 82.0 | 95.9 | 89.8 |
LSCD+OSCD | 1 | RetinaNet+ | 86.1 | 96.3 | 91.2 |
LSCD+OSCD | 1 | FCOS | 83.8 | 96.2 | 90.4 |
LSCD+OSCD | 1 | Faster R-CNN | 80.6 | 95.7 | 89.2 |
LSCD+OSCD | 4 | RetinaNet | 67.4 | 80.8 | 74.1 |
LSCD+OSCD | 4 | RetinaNet+ | 71.5 | 80.9 | 76.4 |
LSCD+OSCD | 4 | FCOS | 71.1 | 82.0 | 76.8 |
LSCD+OSCD | 4 | Faster R-CNN | 64.7 | 81.2 | 73.7 |
Comparison of detection performance between three state-ofthe- art methods on SCD. For the evaluation of LSCD, 1 and 4 labels are all evaluated. LSCD+OSCD means detector are firstly pre-trained in OSCD and then finetuned in LSCD. RetinaNet+ represents GIoU loss is used.
Main results of RetinaNet with all our proposed modules. ”pretrain” means pretraining identity model on OSCD and fine-tuning on LSCD with the image scale of [600,1000]([800,1333]†). ”1x” means the model is trained for total 12 epochs.
If you have been successful in creating a model based on the training set and it performs well on the validation set, we encourage you to run your model on the test set. You can submit your results on the SCD leaderboard by creating a new issue. Your results will be ranked in the leaderboard and to benchmark your approach against that of other machine learners. We are looking forward to your submission. Please click here to submit.
The data set is free for academic use but please do not use it for commercial purposes. You can run them at your own risk. For other purposes, please contact the corresponding author Pan Wang or Jinrong Yang ([email protected], [email protected]).