Skip to content

Latest commit

 

History

History
70 lines (60 loc) · 4.71 KB

DATASET.md

File metadata and controls

70 lines (60 loc) · 4.71 KB

Dataset Preparation

Data format

The data directory is constucted as follows:

.
├── data
|   ├── features
|   |   └── xxx.bin
│   ├── labels
|   |   └── xxx.meta
│   ├── knns
|   |   └── ... 
  • features currently supports binary file. (We plan to support np.save file in near future.)
  • labels supports plain text where each line indicates a label corresponding to the feature file.
  • knns is not necessary as it can be built with the provided functions.

Take MS-Celeb-1M (Part0 and Part1) for an example. The data directory is as follows:

data
  ├── features
    ├── part0_train.bin                 # acbbc780948e7bfaaee093ef9fce2ccb
    ├── part1_test.bin                  # ced42d80046d75ead82ae5c2cdfba621
  ├── labels
    ├── part0_train.meta                # class_num=8573, inst_num=576494
    ├── part1_test.meta                 # class_num=8573, inst_num=584013
  ├── knns
    ├── part0_train/faiss_k_80.npz      # 5e4f6c06daf8d29c9b940a851f28a925
    ├── part1_test/faiss_k_80.npz       # d4a7f95b09f80b0167d893f2ca0f5be5
  ├── pretrained_models
    ├── pretrained_gcn_d_ms1m.pth       # 213598e70ddbc50f5e3661a6191a8be1
    ├── pretrained_gcn_s_ms1m.pth       # 3251d6e7d4f9178f504b02d8238726f7
    ├── pretrained_gcn_d_iop_ms1m.pth   # 314fba47b5156dcc91383ad611d5bd96
    ├── pretrained_gcn_v_ms1m.pth       # 020236d4e8dbff975360f08cb47109c0
    ├── pretrained_gcn_e_ms1m.pth       # 315ff08f28f14bc494dd36158c11e900
    ├── pretrained_lgcn_ms1m.pth        # 97fc6e52d1b5e907eabeb01e7b0825f9

To experiment with custom dataset, it is required to provided extracted features and labels. For training, the number of features should be equal to the number of labels. For testing, the F-score will be evaluated if labels are provided, otherwise only clustering results will be generated.

Note that labels is only required for training clustering model, but it is not mandatory for clustering unlabeled data. Basically, there are two ways to cluster unlabeled data without meta file. (1) Do not pass the label_path in config file. It will not generate loss and evaluation results. (2) Make a pseudo meta label, e.g., setting all labels to -1, but just ignore the loss and the evaluation results.

Supported datasets

The supported datasets are listed below.

You can download datasets with above links or with scripts below:

python tools/download_data.py

Now, you can switch to README.md to train and test the model.