Summer 2021 Project with Prof Celia Greenwood and Amadou Barry
Apply the REPEN method (a machine learning method for outlier detection) to the ABIDE dataset of neuroimaging data.
(directory) data: contains the data sets and the lower dimension representation
(file) abide.csv: must be downloaded separately from
(file) AID362.csv : can be downloaded from
(file) census.csv: can be downloaded from
(subdirectory) representation: contains the lower dimensional representation generated by REPEN
(directory) logs: where the output and error logs are stored
(directory) model: where the model checkpoints are saved in .h5 format during the training
(directory) outlierscores: contains the final outlier scores given by the method after applying random distance-based outlier detection to the lower dimensional representation
(directory) results: contains some measures of performance and plots
(file) abide_loss.png: an image of the loss function (same for other datasets)
(file) AUC_abide.png: an image of the ROC curve with the AUC (same for other datasets)
(file) auc_performance.csv: contains information about the AUC for different runs
(file) barplot_scores_abide.png: an plot of the outlier scores with labelled outliers (same for other datasets)
(file) contains the main method that implements REPEN, and the main functions involved in the training
(file) contains functions to generate a bar plots of the outlier scores, imported by
(file) contains helper functions used in
(file) a submit script for SBATCH that calls and creates a virtual environment with the necessary packages.
(file) outlier_example.R: a script to generate boxplot of a toy example of the impact of outliers
- Download the datasets and place into "data" folder. (For the AID362 dataset, must first convert from .arff to .csv format.)
- In, change the "filename" variable to the name of the dataset ("census", "AID362","abide").
- In the commandline, run the file "".
- Check the results and plots in the "results" directory.
Pang, G., Cao, L., Chen, L., & Liu, H. (2018, July).
Learning representations of ultrahigh-dimensional data for random distance-based outlier detection.
In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2041-2050).