AI project for Flow Cytometry Analysis
This project is aimed at classifying flow cytometry data from 116 patients as either CLL (Chronic Lymphocytic Leukemia) or non-CLL using various Machine Learning methods. Data is provided by the James A. Haley VA in collaboration with Drs. Andrew Borkowski, Phil Foulis, and Loveleen Kang.
These steps have been tested on macOS 10.15 (Catalina) Make sure you have python and pip installed: For Mac, first install homebrew and then
brew install python3
Also make sure you have virtualenv installed:
python3 -m pip install --user virtualenv
Next clone the repo:
git clone https://github.com/kangakum36/FlowAI.git
Make a new virtual environment and activate it
cd FlowAI
python3 -m venv FlowAI
source FlowAI/bin/activate
Next install requirements
pip3 install -r requirements.txt
Finally, you have to install libomp for xgboost to work, on Mac this is:
brew install libomp
Currently, the only way to get access to the data is by emailing me (kangakum [at] gmail [dot] com) for access to a gcp storage bucket. Although we have deidentified the data so it contains no information about the patients, we are trying to be extremely careful in regulating access.
Once you have access, install the google cloud sdk by following these instructions.
Make sure to say yes when prompted to add gcloud
to your PATH, as the data download script won't work otherwise.
Then run
gcloud init
Select the flowai project when prompted, you should already have access if you are at this step. Next run
gcloud components update
Finally, run the following commands (you must have gcloud added to your PATH)
chmod +x ./scripts/download_data.sh
mkdir data
cd scripts
./download_data.sh
- In the Models folder, create a new file (e.g.
<your_model_name>.py
). Create a class for your model that inheritsBaseModel
. - Override the train and test functions.
- In
main.py
, add an abbreviation for your model to the choices for valid abbreviations in theparse_args
method. - In the main method, there is a series of cascading if statements. Add your model to this series and do any necessary data processing and training.
To test a model on the data,
python3 -W ignore main.py -m [model tag]
You should see the an output similar to the following
Selected model accuracy is: 0.875
Current model tags are 'xgb': Gradient Boosted RF Classifier, 'dtc': Decision Tree Classifier, 'nn': Vanilla Neural Net, 'cnn': Convolutional NN
Once you train/test a model once, the processed dataset will be placed in a pickle (.p
) file in the data
folder.
Using the -p
flag on subsequent model tests will load the dataset from the saved file like so:
python3 -W ignore main.py -m [model tag] -p load
This way you don't have to wait for the data to be processed every time
Currently, we have implemented cross validation for the XGB and 1D-Convolutional Neural Network Models.
Additionally, we have grid-searched hyperparameters for the XGB model, as it was the most promising in initial tests.
Best average accuracy: Gradient Boosted Classifier (XGB) - 80% (F1 Score also 0.8)