Using data from Taarifa and the Tanzanian Ministry of Water, we must predict which pumps are functional, need some repairs, and which don't work at all. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania. Click here to visit the competition page.
This repo contains my EDA, data cleaning, feature selection and modelling work for this DrivenData competition, where I currently hold a top 4% score (0.8235). You can also find some additional experiments I performed for a set of Medium articles that I wrote about this competition.
- EDA: Explore your data using data quality reports and visualization
- Data cleaning & Feature selection: Which features did I include in my model?
- Modelling: From baseline models to ensembles
- Missing data: How should you impute your missing data?
- Feature selection: Dealing with multicollinearity
- Modelling experiments: Stacking specialized models