Benchmarking label error detection algorithms for multi-label classification

Code to reproduce results from the paper:

Identifying Incorrect Annotations in Multi-Label Classification Data

This package is a DVC project that uses various datasets to evaluate different label quality scores for detecting annotation errors in multi-label classification. This repository is only intended for scientific purposes. To find label issues your own multi-label data, you should instead use the implementation from the official cleanlab library.

Instructions to get started

Clone the repo 2a [Optional]. Open the repo in a devcontainer 2b. Install the requirements with:

pip install -r requirements.txt

Run the pipeline with:

dvc repro

The pipeline has several stages:

$ dvc dag
              +--------------+
              | make_dataset |
              +--------------+
               ***          ***
              *                *
            **                  **
+------------------+          +-------+
| get_avg_accuracy |          | train |
+------------------+          +-------+
          *                        *
          *                        *
          *                        *
  +-------------+         +---------------+
  | group_stats |         | score_classes |
  +-------------+         +---------------+
                                   *
                                   *
                                   *
                            +-----------+
                            | aggregate |
                            +-----------+
                                   *
                                   *
                                   *
                           +--------------+
                           | rank_metrics |
                           +--------------+
                                   *
                                   *
                                   *
                           +--------------+
                           | plot_metrics |
                           +--------------+
+----------------+
| plot_avg_trace |
+----------------+

A description of each stage is given below.

$ dvc stage list
make_dataset      Create groups of datasets of different sizes & number of classes.
train             Train models and get out-of-sample predicted probabilities on the training sets.
get_avg_accuracy  Get model performance metrics on test sets, with and without label errors.
group_stats       Summarize model performance metrics for each group of datasets.
score_classes     Compute class label quality scores for each example in a dataset.
aggregate         Aggregate class label quality scores for all classes into a single score.
rank_metrics      Compute label error detection metrics for aggregated scores.
plot_metrics      Plot the label error detection and ranking metrics for the aggregated scores.
plot_avg_trace    Plot average traces of noise matrices used for noisy label generation.

The group_stats stage outputs two files in data/accuracies/:
- results_group.csv: All experimental results
- results_agg.json: Overall stats for the different aggregator methods.
The stages have variouus output files and directories. This is best viewed with dvc dag -o. Ignoring most of the intermediate files, the most relevant files are:
- data/accuracy/results_group.csv: Statistics of model performance metrics for each group of datasets.
- data/scores/results.csv: Class label quality scores for each example in each dataset.
- data/scores/metrics.csv: Statistics of label error detection and ranking metrics for each group of datasets.

Inspect the synthetic datasets in the notebooks/inspect_generated_data.ipynb notebook.
Inspect the results in the notebooks/inspect_score_results.ipynb notebook.

Aggregation methods to pool per-class annotation scores into an overall label quality score for each example

Along with the typical np.mean, np.median, np.min, np.max aggregators, we also implement several methods found in src/evaluation/aggregate.py:

softmin_pooling
log_transform_pooling
cumulative_average
simple_moving_average
exponential_moving_average
weighted_cumulative_average

CelebA analysis

See the Examples Notebooks in our examples repository for:

the Pytorch code we used to train a multi-label classifier model on CelebA
the code to find mislabeled images in this dataset

data/celeba/celeba_label_errors.csv in this repository contains: label quality scores for each image in the CelebA dataset and boolean is_issue column that indicates which images were identified to have a label issue by cleanlab

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.devcontainer		.devcontainer
.dvc		.dvc
.github/workflows		.github/workflows
data		data
models		models
notebooks		notebooks
src		src
.dvcignore		.dvcignore
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
params.yaml		params.yaml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking label error detection algorithms for multi-label classification

Instructions to get started

Aggregation methods to pool per-class annotation scores into an overall label quality score for each example

CelebA analysis

About

Releases

Packages

Contributors 3

Languages

License

cleanlab/multilabel-error-detection-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Benchmarking label error detection algorithms for multi-label classification

Instructions to get started

Aggregation methods to pool per-class annotation scores into an overall label quality score for each example

CelebA analysis

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages