Summer 2021 Project with Prof Celia Greenwood and Amadou Barry
Apply the REPEN method (a machine learning method for outlier detection) to the ABIDE dataset of neuroimaging data.
-
(directory) data: contains the data sets and the lower dimension representation
-
(file) abide.csv: must be downloaded separately from http://fcon_1000.projects.nitrc.org/indi/abide/abide_I.html
-
(file) AID362.csv : can be downloaded from https://github.com/GuansongPang/ADRepository-Anomaly-detection-datasets/blob/main/categorical%20data/AID362red_train_allpossiblenominal.arff
-
(file) census.csv: can be downloaded from https://github.com/GuansongPang/ADRepository-Anomaly-detection-datasets/blob/main/numerical%20data/DevNet%20datasets/census-income-full-mixed-binarized.tar.xz
-
(subdirectory) representation: contains the lower dimensional representation generated by REPEN
-
-
(directory) logs: where the output and error logs are stored
-
(directory) model: where the model checkpoints are saved in .h5 format during the training
-
(directory) outlierscores: contains the final outlier scores given by the method after applying random distance-based outlier detection to the lower dimensional representation
-
(directory) results: contains some measures of performance and plots
-
(file) abide_loss.png: an image of the loss function (same for other datasets)
-
(file) AUC_abide.png: an image of the ROC curve with the AUC (same for other datasets)
-
(file) auc_performance.csv: contains information about the AUC for different runs
-
(file) barplot_scores_abide.png: an plot of the outlier scores with labelled outliers (same for other datasets)
-
-
(file) REPEN.py: contains the main method that implements REPEN, and the main functions involved in the training
-
(file) plots_scores.py: contains functions to generate a bar plots of the outlier scores, imported by REPEN.py
-
(file) utilities.py: contains helper functions used in REPEN.py
-
(file) submit.sh: a submit script for SBATCH that calls REPEN.py and creates a virtual environment with the necessary packages.
-
(file) outlier_example.R: a script to generate boxplot of a toy example of the impact of outliers
- Download the datasets and place into "data" folder. (For the AID362 dataset, must first convert from .arff to .csv format.)
- In REPEN.py, change the "filename" variable to the name of the dataset ("census", "AID362","abide").
- In the commandline, run the file "submit.sh".
- Check the results and plots in the "results" directory.
Pang, G., Cao, L., Chen, L., & Liu, H. (2018, July).
Learning representations of ultrahigh-dimensional data for random distance-based outlier detection.
In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2041-2050).