This article has been depreciated. For an update version, check this blogpost.
By Shuai Wang
"If the only tool you have is a hammer, you tend to see every problem as a nail." -- Abraham Maslow
According to a talk by Anthony Goldbloom, CEO of Kaggle, there are only two winning approaches:
- Handcrafted
- Neural Networks
In the first category, it "has almost always been ensembles of decision trees that have won". Random Forest used to be the big winner, but XGBoost has cropped up, winning practically every competition in the structured data category recently.
On the other hand, for any dataset that contains images or speech problems, deep learning is the way to go. And instead of spending almost none of their time doing feature engineering, winners spend their time constructing neural networks.
So, the major takeaway from the talk is that if you have lots of structured data, the handcrafted approach is your best bet, and if you have unusual or unstructured data your efforts are best spent on neural networks.
I've recently conducted a use case proposal of a kaggle competition, and I analyzed (in the appendix) why deep nerual networks, while theoratically being able to approximate any functions, in practice are easliy dwarfed by ensemble tree algorithms in terms of efforts put in vs. results. That said, stacking deep nets in the final ensemble might boost performance, as long as the features extracted are not highly correlated with those from ensemble trees.
======
- Divide Forces
- Data Setup
- Literature Review
- Establish an Evaluation Framework
- Exploratory Data Analysis
- Preprocessing
- Feature Engineering
- Model(s) Tuning
- Ensemble
======
1 + 1 is not always equal to 2 when it comes to team collaboration. More often, it is smaller than 2 due to factors such as individual favorite scripting langauge, workflow, etc. Besides using collaborative tools (Skype, svn, git) to keep effective communication, how to divide force effciently is key to get great results as a team.
According to Kaggle winners' interviews, forming teams can make them more competitive as "combining features really helps improve performance". At the same time, it helps to specialize, having different members focus on different stages of the data pipeline.
So one plausible plan is as follows:
- Complete data setup
- Review literature
- Establish a evaluation framework as a team, either single model nested CV or ensemble
- People with strong statistical background should tackle the EDA task
- People with strong scripting (such as Python) skills should then, with insights from the EDA phase, preprocess the data (this step might involve SQL depending on data sources)
- People with domain expertise, strong statistical skills, and/or practical experience in working with features should focus on the feature engineering part (Note: for deep learning project this step is not always necessary). The output of feature engineering, as suggest by the 1st place Home Depot Product Search Relevance, should be serialized pandas dataframes.
- People with machine learning expertise should handle model tuning and ensmeble parts
So in traditional machine learning problem, after the preprocessing step, the people who work on feature engineering should be producing new feature sets frequently so that machine learning people can tune and train models continuously. On the other hand, in deep learning regime most of the time would be spent on net architecture design and model/parameters tuning.
Clearly documented, reproducible codes and data files are crucial.
======
- Get the data: usually for competition such as Kaggle we can get the data effortlessly by downloading from its website. If we are working with clients, getting data might require further efforts, such as accessing their database or cloud storage.
- Store the data: Either locally or on the cloud, depending on the situation. It's beneficial to have some backup mechanism in place.
======
This may be one of the most important (yet overlooked) part for competing in Kaggle, or doing data science in general. To avoid reinventing the wheels and get inspired on how to preprocess, engineer and model the data, it's worth spend 1/10 to 1/5 of the project time just researching how people previously dealt with similar problems/datasets. Some good places to start are:
- No Free Hunch: the official Kaggle blog, artciles under category Tutorials and Winner's Interviews
- Kaggle: Browse through active/completed competitions, and study the similar projects to the one we are working on
- Google is our friend :)
Time spent on literature review is time well spent.
======
Having a sound evaluation framework is crucial for all the works that follow: if we use suboptimal metrics or don't have an effective cross validaton strategy that could gudie us tune generalizable models, we are aiming at the wrong target and wasting our time.
Metrics
In terms of the metrics to use, if we are competing on Kaggle we can find it under Evalution. If, however, we are working on some real-world data problem where we are free to craft/choose the ruler to measure how good (or bad) our models are, cares should be given about choosing the 'right' metric that makes the most sense for a domain the problem/data at hand. For example, we all know wrongfully classifying a spam email as non-spam does less harm than classifying the non-spam to be spam (Imagine missing an important meeting email from your boss...). Similar case for medical diagnosis. In such situations, simple metric such as accuracy is not enough and we might want to consider other metrics, such as precision, recall, F1, ROC AUC score, etc.
A good place to view different metrics and their use cases is sklearn.metrics page.
Cross Validation (CV) Strategies
The key point for machine learning is to build models that can generalize on unseen data, and a good cross validation strategy help us achieve that (while a bad strategy misleads us and gives us false confidence).
Some good read on cross validation stategies:
- 10-fold CV: some discussion regarding why 10 is the magic number
- Nested cross-validation: Varma and Simon concluded that the true error of the estimate is almost unbiased relative to the test set when nested cross-validation is used
- Cross validaton strategy when blending/stacking
Here is one simple example of nested cross-validation in Python using scikit-learn:
import numpy as np
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier
X_train = ... # your training features
y_train = ... # your training labels
gs = GridSearchCV(
estimator = RandomForestClassifier(random_state=0),
param_grid = {
'n_estimators': [100, 200, 400, 600, 800],
# other params to tune
}
scoring = 'accuracy',
cv = 5
)
scores = cross_val_score(
gs,
X_train,
y_train,
scoring = 'accuracy',
cv = 2
)
print 'CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores))
In a nutshell, the inner loop GridSearchCV
is used to tune parameters, while the outer loop cross_val_score
is used to train with optimal parameters and report an unbiased score. This is known as 2x5 cross-validation (although we can do 3x5, 5x4, or any fold combinations that work for our data size and computational constraints).
======
EDA is an approach to analyze data sets to summarize their main characteristics, often with plots. EDA helps data scientists get a better understanding of the dataset at hands, and guide them to preprocess data and engineer features effectively. Some good resources to help carry out effective EDA are:
- Think Stats 2: an introduction to probability and statistics for Python programmers (its github repository holds some useful codes and ipython notebooks for conducting exploratory data analysis)
- Data Science with Python & R: Exploratory Data Analysis
- There are people who usually post their code for EDA on Kaggle forum at the beginning of each competition, such as this. So check out the forum frequently.
======
The one and only reason we need to preprocess data is so that a machine learning algorithm can learn most effectively from them. Specifically, three issues need to be addressed:
- Sources: we need to integrate data from multiple sources to improve predictability
- Quality: incomplete, noisy, inconsistent data
- Format: incompatible data fortmat. E.g., generalized linear model requires one-hot encoding of ordinal variables. Another case in point is to use word embedding (vector) rather than text as input for nlp algorithms.
In practice, tasks in data preprocessing include:
- Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
- Data integration: use multiple data sources and join/merge data
- Data transformation: normalize, aggregate, and embed (word)
- Data reduction: reduce the volume but produce the same or similar analytical results (e.g., PCA)
- Data discretization: replace numerical attributes with nominal/discrete ones
Also check CCSU course page and scikit-learn documentation for data preprocessing in general and how to do it effectively in Python.
======
"Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." -- Andrew Ng
While in deep learning we usually just normalize the data (such that the image pixels have zero mean and unit variance), in traditional machine learning we need handcrafted features to build accurate models. Doing feature engineering is both art and science, and requires iterative experiments and domain knowledge. Feature engineering boils down to feature selection and creation.
Selection
scikit-learn offers some great feature selection methods. They can be categoried as one of the following:
- Removing features with low variance
- Univariate feature selection
- Recursive feature elimination
- Selecting from model
Here is one simple example of selecting features from model in Python using scikit-learn:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
X_train = ... # your training features
y_train = ... # your training labels
# can be any estimator that has attribute 'feature_importances_' or 'coef_'
model = RandomForestClassifier(random_state=0)
model.fit(X_train, y_train)
fs = SelectFromModel(model, prefit=True)
X_train_new = fs.transform(X_train) # columns selected
In the above example, features with zero importance (feature_importances_ = 0) will be eliminated.
Some good articles about feature selection are:
- How do I perform feature selection
- Discover Feature Engineering, How to Engineer Features and How to Get Good at It
- Streamlining feature engineering
Creation
This is the part where domain knowledge and creativity come in. For example, insurers can create features that help their machine learning model better identify customers with higher risks; similar for geologists who work with geological data.
But here are some general methods that help you create features to boost model performance:
- Add zero_count for each row (especially for sparse matrix)
- Seperate date into year, month, day, weekday or weekend, etc.
- Add percentile change from feature to feature (or other interactions among features)
After adding new handcrafted features, we need to perform another round of feature selection (e.g., using SelectFromModel
to elimiate non-contributing features). Note that different classifiers might select different features, and it's imoprtant that features selected using a certain classifier are later trained with the same classifier. For classifiers that don't have either feature_importances_
or coef_
attribute (e.g., nonparametric classifiers such as KNN), the best way is to cross validate the features selected from various classifiers (i.e., to select the set of feature that has the highest CV score).
Feature engineering is the jewel in crown of machine learning. As machine learning professor [Pedro Domingos] (http://homes.cs.washington.edu/~pedrod/) puts it: "...some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used."
======
If we've got this far, model tuning is actually the easy part: we simply need to provide the parameter search space (coarse to refined) with certain classifiers, and hit the run button and let the machines do all the heavy lifting. In this regard, the nest CV strategy discussed above will do the job perfectly.
If only things were this simple. In practice, however, we face a big constraint: time (deadline to deliver the model to clients or Kaggle competition). To add more complexity, single model never generates the best results, meaning we need to stack many models together to deliver great performance. How to effectively stack (or ensemble) models will be discussed in the next section.
So, one alternative is to optimize the search space automatically, rather than manually setting each parameter from the coarse to the refined. Several Kaggle winners use and recommend hyperopt, a Python library for serial and parallel parameter optimization.
More notes/codes will be added w.r.t how to effectively use hyperopt.
======
Training one machine learning model (e.g., XGBoost, Random Forest, KNN, or SVM) can give us decent results, but not the best ones. If in practice we want to boost the performance, we might want to stack models using a combiner, such as logistic regression. This 1st place solution for Otto Group Product Classification Challenge is AWESOME. Here is one way how we do stacking:
- Preprocessing and feature engineering (which are addressed above)
- First layer: Use nested CV to train N models (1st place of Otto competition used same fold indices), generating N * C (C== num_class) probabilistic predictions (or, meta features) for each data point.
- Second layer
- Single combiner: use the N * C meta features from first layer, plus optionally handcrafted features to train another model (e.g, logistic regression) with nested CV.
- Multiple combiners: train multiple single combiner with nest CV (1st place of Otto competition used random indices across different combiners).
- Third layer (for multiple combiners): cross validate the weights of combiners in the second layer (can be simple linear weights or complex ones such as geometric mean)