Skip to content

MyriamLizotte/outlier-detection-using-ML

Repository files navigation

Outlier Detection in High Dimension with Machine Learning

Summer 2021 Project with Prof Celia Greenwood and Amadou Barry

Goal

Apply the REPEN method (a machine learning method for outlier detection) to the ABIDE dataset of neuroimaging data.

File Structure

  • (directory) data: contains the data sets and the lower dimension representation

  • (directory) logs: where the output and error logs are stored

  • (directory) model: where the model checkpoints are saved in .h5 format during the training

  • (directory) outlierscores: contains the final outlier scores given by the method after applying random distance-based outlier detection to the lower dimensional representation

  • (directory) results: contains some measures of performance and plots

    • (file) abide_loss.png: an image of the loss function (same for other datasets)

    • (file) AUC_abide.png: an image of the ROC curve with the AUC (same for other datasets)

    • (file) auc_performance.csv: contains information about the AUC for different runs

    • (file) barplot_scores_abide.png: an plot of the outlier scores with labelled outliers (same for other datasets)

  • (file) REPEN.py: contains the main method that implements REPEN, and the main functions involved in the training

  • (file) plots_scores.py: contains functions to generate a bar plots of the outlier scores, imported by REPEN.py

  • (file) utilities.py: contains helper functions used in REPEN.py

  • (file) submit.sh: a submit script for SBATCH that calls REPEN.py and creates a virtual environment with the necessary packages.

  • (file) outlier_example.R: a script to generate boxplot of a toy example of the impact of outliers

How to run

  1. Download the datasets and place into "data" folder. (For the AID362 dataset, must first convert from .arff to .csv format.)
  2. In REPEN.py, change the "filename" variable to the name of the dataset ("census", "AID362","abide").
  3. In the commandline, run the file "submit.sh".
  4. Check the results and plots in the "results" directory.

References

Pang, G., Cao, L., Chen, L., & Liu, H. (2018, July).
Learning representations of ultrahigh-dimensional data for random distance-based outlier detection.
In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2041-2050).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published