Learning to defer (L2D) aims to improve human-AI collaboration systems by deferring decisions to humans when they are more likely to make the correct judgment than a ML classifier. Existing research in L2D overlooks key aspects of real-world systems that impede its practical adoption, such as: i) neglecting cost-sensitive scenarios, where type 1 and type 2 errors have separate costs; ii) requiring concurrent human predictions for every instance of the dataset in training and iii) not dealing with human work capacity constraints. To address these issues, we propose the Deferral under Cost and Capacity Constraints Framework (DeCCaF) - a novel L2D approach, employing supervised learning to model the probability of human error with less restrictive data requirements (only one expert prediction per instance), and using constraint programming to globally minimize error cost subject to workload constraints. We test DeCCaF in a series of cost-sensitive fraud detection scenarios with different teams of 9 synthetic fraud analysts, with individual work capacity constraints. We demonstrate that our approach performs significantly better than the baselines in a wide array of scenarios, achieving an average reduction in the misclassification cost of 8.4%.
In order to ensure complete reproducibility, we provide users with:
- Code used to run experiments.
- Datasets, models and results used/produced in our experiments.
- Synthetically Generated Data - Expert predictions, training scenarios and capacity constraints
- ML models - Alert Model, OvA Classifiers and Human Expertise Model
- Results - Set of assignments and decisions resulting from the deferral experiments
Note: This data is included due to the fact that LightGBM models are known to produce different results depending on operating system, python versions, number of cores in training, among other factors
The submitted version of the paper and the appendix are available here.
Requirements:
- miniforge3
Before using any of the provided code, to ensure reproducibility, please create and activate the Python environment by running
conda env create -f environment.yml
conda activate deccaf-env
To replicate the generation of the synthetic data, as well as our experiments, please execute the following steps:
Attention: Run each python script inside the folder where it is located, to ensure the relative paths within each script work correctly.
After cloning the repo, please extract the Datasets, models and results file inside the repo's folder, ensuring that your directory looks like this
Note that during the following steps, training of models and generation of results will be skipped if the output files already exist within the Data_and_models folder. This was done to ensure complete reproducibility and analysis of the source files used in the paper. If you wish, however, to run experiments from scratch, you will have to delete every folder and file within the Data directory, except for:
- Data_and_models/data/Base.csv: the raw version of the BAF dataset
- Data_and_models/experts/: the synthetic expert predictions used in the paper.
- Data_and_models/testbed/: the expert capacity constraints and batches used in the paper.
The code for the expert and testbed generation is not made available, as the expert simulation framework was submitted as a contribution to a different venue, focusing on synthetic data generation.
deccaf
│ README.md
│ .gitignore
│ environment.yml
│
└─── Code
│ │ ...
│
└─── Data_and_models
│ ...
To activate the Python environment with the necessary dependencies please follow these steps
To train the Alert Model, run the file Code/alert_model/training_and_predicting.py, which will train the Alert Model and score all instances in months 4-8 of the BAF dataset.
Then, run the file Code/data/preprocess.py, to create the dataset of 30K alerts raised in months 4-8. This will be the set of instances used over all following processes.
As both DeCCaF and OvA share the classifier h, we train it first, by running the script Code/classifier_h/training.py. The classifier is trained first due to the fact that its performance is used as a reference to generate experts with a similar misclassification cost.
To train the DeCCaF system run the script Code/expert_models/run_deccaf.py. To train the OvA system run the script Code/expert_models/run_ova.py.
To reproduce the deferral testing run the script Code/deferral/run_alert.py. These results can then be evaluated with the notebook Code/deferral/process_results.ipynb
We include notebooks to facilitate analysis of:
- Synthetic Experts' Decisions
- ML Model, Human Expertise Model and OvA Classifiers
- Deferral Results
We also facilitate further analysis of our generated experts and the conducted benchmarks, by providing users with two Jupyter Notebooks
- Code/deferral/process_results.ipynb - which contains
- evaluation of the deferral performance of all considered L2D baselines
- evaluation of the performance and calibration of Classifier h, OvA Classifiers, and DeCCaF's team correctness prediction models.
- Code/synthetic_experts/expert_analysis.ipynb - which contains the evaluation of the expert decision-making process properties.