Outlier Detection in High Dimension with Machine Learning

Summer 2021 Project with Prof Celia Greenwood and Amadou Barry

Goal

Apply the REPEN method (a machine learning method for outlier detection) to the ABIDE dataset of neuroimaging data.

File Structure

(directory) data: contains the data sets and the lower dimension representation
- (file) abide.csv: must be downloaded separately from http://fcon_1000.projects.nitrc.org/indi/abide/abide_I.html
- (file) AID362.csv : can be downloaded from https://github.com/GuansongPang/ADRepository-Anomaly-detection-datasets/blob/main/categorical%20data/AID362red_train_allpossiblenominal.arff
- (file) census.csv: can be downloaded from https://github.com/GuansongPang/ADRepository-Anomaly-detection-datasets/blob/main/numerical%20data/DevNet%20datasets/census-income-full-mixed-binarized.tar.xz
- (subdirectory) representation: contains the lower dimensional representation generated by REPEN
(directory) logs: where the output and error logs are stored
(directory) model: where the model checkpoints are saved in .h5 format during the training
(directory) outlierscores: contains the final outlier scores given by the method after applying random distance-based outlier detection to the lower dimensional representation
(directory) results: contains some measures of performance and plots
- (file) abide_loss.png: an image of the loss function (same for other datasets)
- (file) AUC_abide.png: an image of the ROC curve with the AUC (same for other datasets)
- (file) auc_performance.csv: contains information about the AUC for different runs
- (file) barplot_scores_abide.png: an plot of the outlier scores with labelled outliers (same for other datasets)
(file) REPEN.py: contains the main method that implements REPEN, and the main functions involved in the training
(file) plots_scores.py: contains functions to generate a bar plots of the outlier scores, imported by REPEN.py
(file) utilities.py: contains helper functions used in REPEN.py
(file) submit.sh: a submit script for SBATCH that calls REPEN.py and creates a virtual environment with the necessary packages.
(file) outlier_example.R: a script to generate boxplot of a toy example of the impact of outliers

How to run

Download the datasets and place into "data" folder. (For the AID362 dataset, must first convert from .arff to .csv format.)
In REPEN.py, change the "filename" variable to the name of the dataset ("census", "AID362","abide").
In the commandline, run the file "submit.sh".
Check the results and plots in the "results" directory.

References

Pang, G., Cao, L., Chen, L., & Liu, H. (2018, July).
Learning representations of ultrahigh-dimensional data for random distance-based outlier detection.
In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2041-2050).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Outlier Detection in High Dimension with Machine Learning

Goal

File Structure

How to run

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data/representation		data/representation
model		model
outlierscores		outlierscores
presentations		presentations
results		results
README.md		README.md
REPEN.py		REPEN.py
outlier_example.R		outlier_example.R
plot_scores.py		plot_scores.py
submit.sh		submit.sh
utilities.py		utilities.py

MyriamLizotte/outlier-detection-using-ML

Folders and files

Latest commit

History

Repository files navigation

Outlier Detection in High Dimension with Machine Learning

Goal

File Structure

How to run

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages