This project contains a machine learning pipeline for predicting the funding success of DonorsChoose projects, with a particular focus on the impact of poverty levels.
DonorsChoose is a platform where teachers can request resources for their classrooms. This project aims to predict whether a project will be successfully funded based on various features, including poverty levels of the school districts.
run_pipeline.py
: Main script to run the entire pipelineconfig.json
: Configuration file for dataset paths and feature selectiondata/
: Directory containing all raw datasetspreprocessing/
: Directory containing preprocessing scriptsfeature_selection.py
: Script for merging datasets and selecting featuresdata_cleaning.py
: Script for cleaning and preprocessing the datadata_segment_and_balance.py
: Script for segmenting and balancing the data
features/
: Directory for feature engineering scriptsfeature_engineering.py
: Script for feature engineering
split/
: Directory for data splitting scriptstrain_test_split.py
: Script for splitting data into training and testing sets
model/
: Directory for machine learning models, including training, validation, and selection scriptsmodel_training.py
: Script for training the modelmodel_evaluation.py
: Script for evaluating the modelfeature_importance.py
: Script for determining feature importancerecommendation.py
: Script for generating recommendations
outputs/
: Directory where processed datasets are savedfigures/
: Directory for saving generated figures and plotsnotebooks/
: Directory for Jupyter notebooks used for generating graphs and as a playground for experimentation
- Clone this repository to your local machine.
- Ensure you have Python installed (preferably Python 3.7+).
- Install the required dependencies:
pip install pandas numpy
- Place your raw DonorsChoose datasets in the
data/
directory (Make sure it match the name inconfig.json
). - Review and update the
config.json
file if necessary.
You can run the entire pipeline using the run_pipeline.py
script:
python run_pipeline.py
This will execute the following steps:
- Merge datasets and select features
- Clean and preprocess the data
- Merge datasets and select features
- Perform feature engineering
- Clean and preprocess the data
- Split data into training and testing sets
- Segment and balance the data
- Train the machine learning model
- Evaluate the model
- Determine feature importance
- Generate recommendations
Alternatively, you can run the scripts individually in the following order:
- python preprocessing/feature_selection.py
- python features/feature_engineering.py
- python preprocessing/data_cleaning.py
- python split/train_test_split.py
- python preprocessing/data_segment_and_balance.py
- python model/model_training.py
- python model/model_evaluation.py
- python model/feature_importance.py
- python model/recommendation.py
The config.json
file contains important settings for the pipeline:
raw_datasets
: Paths to the input CSV files (donations, essays, projects, resources, outcomes)dataset
: Path for the output cleaned datasetfeatures_to_use
: List of features to select from the merged datasetone_hot_encode_features
: List of categorical features to one-hot encodemodels
: List of models to useprojects_imputation
: Methods for imputing missing values in the projects datasetpoverty_columns
: Mapping of poverty levelssplit_by_poverty
: Whether to split data by poverty leveltest_splits
: Configuration for test splitspoverty_level_replacements
: Replacements for poverty levelsquant_variables
: List of quantitative variablesstem_cols
: List of STEM-related columns
The pipeline generates several types of output files in the outputs/
directory, including:
- CSV files: Containing datasets at various stages of processing (e.g., selected features, cleaned data, training and testing sets, model outputs)
- PKL files: Serialized model objects
Additionally, figures and plots generated during the analysis are saved in the figures/
directory.