[add] add evaluation on GDINO, SPHINX and a LEADERBOARD

shikras · Feb 15, 2024 · 0da1221 · 0da1221
1 parent 997d968
commit 0da1221
Show file tree

Hide file tree

Showing 11 changed files with 735 additions and 103 deletions.
diff --git a/.gitignore b/.gitignore
@@ -158,4 +158,7 @@ cython_debug/
 #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
-#.idea/
+#.idea/
+
+# mac system
+*.DS_Store
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@
     The repo is the toolbox for <b>D<sup>3</sup></b>
     <br />
     <a href="doc.md"><strong> [Doc 📚]</strong></a>
-    <a href="https://huggingface.co/datasets/zbrl/d-cube"><strong> [HuggingFace 🤗]</strong></a>
+    <!-- <a href="https://huggingface.co/datasets/zbrl/d-cube"><strong> [HuggingFace 🤗]</strong></a> -->
     <a href="https://arxiv.org/abs/2307.12813"><strong> [Paper (DOD) 📄] </strong></a>
     <a href="https://arxiv.org/abs/2305.12452"><strong> [Paper (GRES) 📄] </strong></a>
     <a href="https://github.com/Charles-Xie/awesome-described-object-detection"><strong> [Awesome-DOD 🕶️] </strong></a>
@@ -20,25 +20,32 @@
 Description Detection Dataset ($D^3$, /dikju:b/) is an attempt at creating a next-generation object detection dataset. Unlike traditional detection datasets, the class names of the objects are no longer simple nouns or noun phrases, but rather complex and descriptive, such as `a dog not being held by a leash`. For each image in the dataset, any object that matches the description is annotated. The dataset provides annotations such as bounding boxes and finely crafted instance masks. We believe it will contribute to computer vision and vision-language communities.
 
 
+
 # News
-- [10/12/2023] We released an [awesome-described-object-detection](https://github.com/Charles-Xie/awesome-described-object-detection) list to collect and track related works. The paper is renamed as *Described Object Detection: Liberating Object Detection with Flexible Expressions* ([arxiv](https://arxiv.org/abs/2307.12813)).
+- [02/14/2024] Evaluation on several SOTA methods (SPHNX (the first MLLM evaluated!), G-DINO, UNINEXT, etc.) are released, together with a [leaderboard](https://github.com/shikras/d-cube/tree/main/eval_sota) for $D^3$. :fire::fire:
+
+- [10/12/2023] We released an [awesome-described-object-detection](https://github.com/Charles-Xie/awesome-described-object-detection) list to collect and track related works.
 
 - [09/22/2023] Our DOD [paper](https://arxiv.org/abs/2307.12813) just got accepted by NeurIPS 2023! :fire:
 
 - [07/25/2023] This toolkit is available on PyPI now. You can install this repo with `pip install ddd-dataset`.
 
-- [07/25/2023] The [paper preprint](https://arxiv.org/abs/2307.12813) of *Exposing the Troublemakers in Described Object Detection*, introducing the DOD task and the $D^3$ dataset, is available on arxiv. Check it out!
+- [07/25/2023] The [paper preprint](https://arxiv.org/abs/2307.12813) introducing the DOD task and the $D^3$ dataset, is available on arxiv. Check it out!
 
 - [07/18/2023] We have released our Description Detection Dataset ($D^3$) and the first version of $D^3$ toolbox. You can download it now for your project.
 
 - [07/14/2023] Our GRES [paper](https://arxiv.org/abs/2305.12452) has been accepted by ICCV 2023.
 
+
+
 # Contents
 - [Dataset Highlight](#task-and-dataset-highlight)
 - [Download](#download)
 - [Installation](#installation)
 - [Usage](#usage)
 
+
+
 # Task and Dataset Highlight
 
 The $D^3$ dataset is meant for the Described Object Detection (DOD) task. In the image below we show the difference between Referring Expression Comprehension (REC), Object Detection/Open-Vocabulary Detection (OVD) and Described Object Detection (DOD). OVD detect object based on category name, and each category can have zero to multiple instances; REC grounds one region based on a language description, whether the object truly exits or not; DOD detect all instances on each image in the dataset, based on a flexible reference. Related works are tracked in the [awesome-DOD](https://github.com/Charles-Xie/awesome-described-object-detection) list.
@@ -47,17 +54,24 @@ The $D^3$ dataset is meant for the Described Object Detection (DOD) task. In the
 
 For more information on the characteristics of this dataset, please refer to our paper.
 
+
+
 # Download
 Currently we host the $D^3$ dataset on cloud drives. You can download the dataset from [Google Drive](https://drive.google.com/drive/folders/11kfY12NzKPwsliLEcIYki1yUqt7PbMEi?usp=sharing) or [Baidu Pan]().
 
 After downloading the `d3_images.zip` (images in the dataset), `d3_pkl.zip` (dataset information for this toolkit) and `d3_json.zip` (annotation for evaluation), please extract these 3 zip files to your custom `IMG_ROOT`, `PKL_PATH` and `JSON_ANNO_PATH` directory. These paths will be used when you perform inference or evaluation on this dataset.
 
+
+
 # Installation
 
 ## Prerequisites
 This toolkit requires a few python packages like `numpy` and `pycocotools`. Other packages like `matplotlib` and `opencv-python` may also be required if you want to utilize the visualization scripts.
 
-There are three ways to install $D^3$ toolbox, and the third one (with huggingface) is currently in the works and will be available soon.
+<!-- There are three ways to install $D^3$ toolbox, and the third one (with huggingface) is currently in the works and will be available soon. -->
+
+There are multiple ways to install $D^3$ toolbox, as listed below:
+
 
 ## Install with pip
 ```bash
@@ -75,10 +89,12 @@ python -m pip install .
 # option 2: just put the d-cube/d_cube directory in the root directory of your local repository
 ```
 
-## Via HuggingFace Datasets 🤗
+<!-- ## Via HuggingFace Datasets 🤗
 ```bash
 coming soon
-```
+``` -->
+
+
 
 # Usage
 Please refer to the [documentation 📚](doc.md) for more details.
@@ -93,8 +109,12 @@ all_img_info = d3.load_imgs(all_img_ids)  # load images by passing a list of som
 img_path = all_img_info[0]["file_name"]  # obtain one image path so you can load it and inference
 ```
 
+Some frequently asked questions are answered in [this Q&A file](./qa.md).
+
 # Citation
+
 If you use our $D^3$ dataset, this toolbox, or otherwise find our work valuable, please cite [our paper](https://arxiv.org/abs/2307.12813):
+
 ```bibtex
 @inproceedings{xie2023DOD,
   title={Described Object Detection: Liberating Object Detection with Flexible Expressions},
@@ -111,4 +131,4 @@ If you use our $D^3$ dataset, this toolbox, or otherwise find our work valuable,
 }
 ```
 
-More works related to Described Object Detection are tracked in this list: [awesome-described-object-detection](https://github.com/Charles-Xie/awesome-described-object-detection).
+More works related to Described Object Detection are tracked in this list: [awesome-described-object-detection](https://github.com/Charles-Xie/awesome-described-object-detection).
diff --git a/d_cube/d3.py b/d_cube/d3.py
@@ -513,7 +513,7 @@ def stat_description(self, with_rev=False, inter_group=False):
             num_img_sent += len(cur_sent_set)
         stat_dict["num_img_sent"] = num_img_sent
 
-        # Number of anti img-sent pair
+        # Number of absence img-sent pair
         num_anti_img_sent = 0
         for img_id in self.data["images"].keys():
             anno_ids = self.get_anno_ids(img_ids=img_id)

diff --git a/doc.md b/doc.md
@@ -1,12 +1,17 @@
 # $D^3$ Toolkit Documentation
 
+
 ## Table of Contents
 
 - [Inference](#inference-on-d3)
 - [Key Concepts](#key-concepts-for-users)
-- [Evaluation](#evaluation)
+- [Evaluation Settings](#evaluation-settings)
+- [Evaluation Code and Examples](#evaluation-code-and-examples)
 - [Dataset statistics](#dataset-statistics)
 
+
+
+
 ## Inference on $D^3$
 
 ```python
@@ -22,7 +27,7 @@ img_path = all_img_info[0]["file_name"]  # obtain one image path so you can load
 group_ids = d3.get_group_ids(img_ids=[img_id])  # get the group ids by passing anno ids, image ids, etc.
 sent_ids = d3.get_sent_ids(group_ids=group_ids)  # get the sentence ids by passing image ids, group ids, etc.
 sent_list = d3.load_sents(sent_ids=sent_ids)
-ref_list = [sent['raw_sent'] for sent in sent_list]
+ref_list = [sent['raw_sent'] for sent in sent_list]  # list[str]
 # use these language references in `ref_list` as the references to your REC/OVD/DOD model
 
 # save the result to a JSON file
@@ -32,26 +37,26 @@ Concepts and structures of `anno`, `image`, `sent` and `group` are explained in
 
 In [this directory](eval_sota/) we provide the inference (and evaluation) script on some existing SOTA OVD/REC methods.
 
+
+
 ### Output Format
 When the inference is done, you need to save a JSON file in the format below (COCO standard output JSON form):
 ```json
 [
     {
         "category_id": "int, the value of sent_id, range [1, 422]",
-        "bbox": "[x1, y1, w, h], predicted by your model, same as COCO result format",
+        "bbox": "list[int], [x1, y1, w, h], predicted by your model, same as COCO result format, absolute value in the range of [w, h, w, h]",
         "image_id": "int, img_id, can be 0, 1, 2, ....",
         "score": "float, predicted by your model, no restriction on its absolute value range"
     }
 ]
 ```
 This JSON file should contain a list, where each item in the list is a dictionary of one detection result.
 
-With this JSON saved, you can evaluate the JSON in the next step. See [the evaluation step](#evaluation).
+With this JSON saved, you can evaluate the JSON in the next step. See [the evaluation step](#evaluation-code-and-examples).
 
 
-### Intra- or Inter-Group Settings
 
-The default evaluation protocol is the intra-group setting, where only a certain references are considerred for each image. Inter-group setting, where all references in the dataset are considerred for each image, can be easily achieved by changing `sent_ids = d3.get_sent_ids(group_ids=group_ids)` to `sent_ids = d3.get_sent_ids()`. This will use all the sentences in the dataset, rather than a few sentences in the group that this image belongs to.
 
 
 ## Key Concepts for Users
@@ -116,7 +121,7 @@ A Python dictionary where the keys are integers and the values are dictionaries
 * `id`: an integer representing the ID of the sentence.
 * `anno_id`: a list of integers representing the IDs of annotations associated with this sentence.
 * `group_id`: a list of integers representing the IDs of groups associated with this sentence.
-* `is_negative`: a boolean indicating whether this sentence is anti-expression or not.
+* `is_negative`: a boolean indicating whether this sentence is *absence expression* or not. `True` means *absence expression*.
 * `raw_sent`: a string representing the raw text of the sentence in English.
 * `raw_sent_zh`: a string representing the raw text of the sentence in Chinese.
 
@@ -137,7 +142,7 @@ A Python dictionary where the keys are integers and the values are dictionaries
 A Python dictionary where the keys are integers and the values are dictionaries with the following key-value pairs:
 
 * `id`: an integer representing the ID of the group.
-* `pos_sent_id`: a list of integers representing the IDs of  sentences that has referred obejct in the group.
+* `pos_sent_id`: a list of integers representing the IDs of sentences that has referred obejct in the group.
 * `inner_sent_id`: a list of integers representing the IDs of sentences belonging to this group.
 * `outer_sent_id`: a list of integers representing the IDs of outer-group sentences that has referred obejct in the group.
 * `img_id`: a list of integers representing the IDs of images of this group.
@@ -160,9 +165,61 @@ A Python dictionary where the keys are integers and the values are dictionaries
 }
 ```
 
-## Evaluation
+
+
+
+
+## Evaluation Settings
+
+
+### Intra- or Inter-Group Settings
+
+The default evaluation protocol is the intra-group setting, where only a certain references are evaluated for each image.
+
+In the $D^3$ dataset, images are collected for different groups (scenarios), and the categories (descriptions) are designed based on the scenarios. For the intra-group setting, each image are only evaluated with the descriptions from the group the image belongs to. We call this **intra-scenario setting**.
+
+Note that each category is actually annotated on each image (with positive or negative instances).
+So you can also evaluate all categories on all images, just like traditional detection datasets. We call this **inter-scenario setting**.
+This is quite challenging for the DOD task as this will produce many false positive instances on current methods.
+
+For intra-group evaluation, you should use:
+```
+sent_ids = d3.get_sent_ids(group_ids=group_ids)
+# only get the refs (sents) for the group the image belongs to, which is usually 4
+```
+
+For inter-group evaluation, change the correponding code to:
+
+```
+sent_ids = d3.get_sent_ids()
+# get all the refs in the dataset
+```
+
+This will use all the sentences in the dataset, rather than a few sentences in the group that this image belongs to.
+
+This is the only difference in the implentation and evaluation. No further code changes need to be applied.
+
+For more information, you can refer to the Section 3.4 of the DOD paper.
+
+
+### FULL, PRES and ABS
+
+FULL, PRES and ABS means the full descriptions (422 categories), presence descriptions (316 categories) and absence descriptions (106 categories).
+
+The meaning of absence descriptions are the descriptions involving the absence of some concepts, like lacking certain relationships, attributes or objects. For example, descriptions like "dog *without* leash", "person *without* helmet" and "a hat that is *not* blue" are absence ones.
+Similary, the descriptions involving *only* the presence of some concepts are presence descriptions.
+
+Most existing REC datasets have presence descriptions but few absence descriptions.
+
+For more details and the meaning of evaluating absence descriptions, please refer to Section 3.1 of the DOD paper.
+
+
+
+
+## Evaluation Code and Examples
 
 In this part, we introduce how to evaluate the performance and get the metric values given the prediction result of a JSON file.
+
 ### Write a Snippet in Your Code
 
 This is based on [cocoapi (pycocotools)](https://github.com/cocodataset/cocoapi/tree/master/PythonAPI), and is quite simple:
@@ -204,10 +261,15 @@ optional arguments:
   --xyxy2xywh          transform box coords from xyxy to xywh
 ```
 
-## Evaluation Examples on SOTA Methods
 
-See [this directory](eval_sota/) for details. More scripts for evaluating popular SOTA OVD/REC/other methods on $D^3$ will be added later.
+### Evaluation Examples on SOTA Methods
+
+See [this directory](eval_sota/) for details. We include the evaluation scripts of some methods there.
+
+
 
 ## Dataset Statistics
 
 [A python script](scripts/get_d3_stat.py) is provided for calculating the statistics of $D^3$ or visualizing figures like histograms, word clouds, etc.
+
+The specific statistics of the dataset are available in Section 3.3 of the DOD paper.
diff --git a/eval_sota/README.md b/eval_sota/README.md
@@ -0,0 +1,27 @@
+# Evaluting SOTA Methods on $D^3$
+
+## Leaderboard
+
+In this directory, we keep the scripts or github links (official or custom) to evaluate SOTA methods (REC/OVD/DOD/MLLM) on $D^3$:
+
+| Name | Paper | Original Tasks | Training Data | Evaluation Code | Intra-FULL/PRES/ABS/Inter-FULL/PRES/ABS | Source | Note |
+|:-----|:-----:|:----:|:-----:|:-----:|:-----:|:-----:|:-----:|
+| OFA-large | [OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework (ICML 2022)](https://arxiv.org/abs/2202.03052) | REC | - | - | 4.2/4.1/4.6/0.1/0.1/0.1 | [DOD paper](https://arxiv.org/abs/2307.12813) | - |
+| CORA-R50 | [CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching (CVPR 2023)](https://openaccess.thecvf.com/content/CVPR2023/papers/Wu_CORA_Adapting_CLIP_for_Open-Vocabulary_Detection_With_Region_Prompting_and_CVPR_2023_paper.pdf) | OVD | - | - | 6.2/6.7/5.0/2.0/2.2/1.3 | [DOD paper](https://arxiv.org/abs/2307.12813) | - |
+| OWL-ViT-large | [Simple Open-Vocabulary Object Detection with Vision Transformers (ECCV 2022)](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136700714.pdf) | OVD | - | [DOD official](./owl_vit.py) | 9.6/10.7/6.4/2.5/2.9/2.1 | [DOD paper](https://arxiv.org/abs/2307.12813) | Post-processing hyper-parameters may affect the performance and the result may not exactly match the paper |
+| SPHINX-7B | [SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models (arxiv 2023)](https://arxiv.org/abs/2311.07575) | **MLLM** capable of REC | - | [DOD official](./sphinx.py) | 10.6/11.4/7.9/-/-/- | DOD authors | A lot of contribution from [Jie Li](https://github.com/theFool32) |
+| GLIP-T | [Grounded Language-Image Pre-training (CVPR 2022)](https://arxiv.org/abs/2112.03857)  | OVD & PG | - | - | 19.1/18.3/21.5/-/-/- | GEN paper | - |
+| UNINEXT-huge | [Universal Instance Perception as Object Discovery and Retrieval (CVPR 2023)](https://arxiv.org/abs/2303.06674v2) | OVD & REC | - | [DOD official](https://github.com/Charles-Xie/UNINEXT_D3) | 20.0/20.6/18.1/3.3/3.9/1.6 | [DOD paper](https://arxiv.org/abs/2307.12813) | - |
+| Grounding-DINO-base | [Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (arxiv 2023)](https://arxiv.org/abs/2303.05499) | OVD & REC | - | [DOD official](./groundingdino.py) | 20.7/20.1/22.5/2.7/2.4/3.5 | [DOD paper](https://arxiv.org/abs/2307.12813) | Post-processing hyper-parameters may affect the performance and the result may not exactly match the paper |
+| OFA-DOD-base | [Described Object Detection: Liberating Object Detection with Flexible Expressions (NeurIPS 2023)](https://arxiv.org/abs/2307.12813) | DOD | - | - | 21.6/23.7/15.4/5.7/6.9/2.3 | [DOD paper](https://arxiv.org/abs/2307.12813) | - |
+| FIBER-B | [Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone (NeurIPS 2022)](https://arxiv.org/abs/2206.07643) | OVD & REC | - | - | 22.7/21.5/26.0/-/-/- | GEN paper | - |
+| MM-Grounding-DINO | [An Open and Comprehensive Pipeline for Unified Object Grounding and Detection (arxiv 2024)](https://arxiv.org/abs/2401.02361) | DOD & OVD & REC | O365, GoldG, GRIT, V3Det | [MM-GDINO official](https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino#zero-shot-description-detection-datasetdod) | 22.9/21.9/26.0/-/-/- | MM-GDINO paper | - |
+| GEN (FIBER-B) | [Generating Enhanced Negatives for Training Language-Based Object Detectors (arxiv 2024](https://arxiv.org/abs/2401.00094) | DOD | - | - | 26.0/25.2/28.1/-/-/- | GEN paper | Enhancement based on FIBER-B |
+| APE-large (D) | [Aligning and Prompting Everything All at Once for Universal Visual Perception (arxiv 2023)](https://arxiv.org/abs/2312.02153) | DOD & OVD & REC | COCO, LVIS, O365, OpenImages, Visual Genome, RefCOCO/+/g, SA-1B, GQA, PhraseCut, Flickr30k | [APE official](https://github.com/shenyunhang/APE) | 37.5/38.8/33.9/21.0/22.0/17.9 | APE paper | Extra training data helps for this amazing performance |
+
+
+Some extra notes:
+- Each method is currently recorded by *the variant with the highest performance* in this table, if there are multiple variants available, so it's only a leaderboard, not meant for fair comparison.
+- Methods like GLIP, FIBER, etc. are actually not evaluated on OVD benchmarks. For zero-shot eval on DOD, We currently do not distinguish between methods for OVD benchmarks and methods for ZS-OD, as long as it is verified with open-set detection capability.
+
+For other variants (e.g. for a fair comparison regarding data, backbone, etc.), please refer to the papers.