Aim: Based on a customer's profile, predict which type of touchpoint has the highest probability of resulting in a purchase
Project overview:
- Created a tool to predict touchpoint for a customer based on their profiles
- Optimized Random Forest and XGBoost classifiers using GridSearchCV to get the best model
- Model made deployment ready with Pickle
Workflow:
data-cleaning_and_eda.ipynb -> model-building.ipynb
For categorical variables, I made columns for each such that they are transformed into binary variables.
Metrics for evaluating models:
- Multiclass logloss since we are predicting the probabilities of the next touchpoint, I want to find the average difference between all probability distributions.
- F1-Score(Micro) since we have imbalanced classes of labels.
I wrote a custom script to split my dataset into train, validation and test sets using the stratify strategy. Train size 80%, Validation set and Test set 10% each.
Random Forest
I picked RF Classifer simply because it runs fast and I am able to use GridSearchCV to iterate to the best model possible efficiently. After initializing and tuning my RandomForestClassifier model with GridSearchCV, I got a train accuracy of 1.0 and test accuracy of 0.77688 which shows overfitting.
Our RF Classifier seems to pay more attention to average spending, income and age.
XGBoost
Initial XGB model
XGB model after tuning with GridSearchCV : max_depth, min_child_weight and reg_alpha
Our XGBoost model pays high attention to the 'unknown' marital status. This could be due to the fact that there are only 44 customers with 'unknown' marital status, hence to reduce bias, our xgb model assigns more weight to 'unknown' feature.
XGBoost Accuracy: 0.9678972712680578
XGBoost F1-Score (Micro): 0.9678972712680578
I will pick the final XGBoost model since it gives significantly higher F1-score and accuracy. We can also easily control overfitting by further tuning the reg_alpha value in our model.
I included a pickle file for further deployment of the model into FlaskAPI in the future! For productionization, a flask API endpoint can be hosted on a server and it will take in a list of values from a customer's profile and return the recommended touchpoint.