Denote the set of features by and the set of classes by . Starting with the full training data of size , we grow a tree by splitting data at each node into two parts according to the values of a selected feature, until no more split can be generated (or some regularization techniques are applied).
Conceptually, we want to split the data at a node such that the division of class labels becomes clearer at the two children nodes. Technically, we choose the most informative feature at each node, and the split results in the biggest reduction in uncertainty.
The math below defines the measure of uncertainty and specifies what we exactly mean by an informative feature.
Given the training data of size at node , the frequency of samples in class is
where is a sample in and y is its label.
There are two common measures of uncertainty (or impurity):
To understand these measures, we consider a binary problem with classes 0 and 1, and plot gini score and entropy against .
It is easy to see that both measures reach the maximum when , which agrees with our intuition.
where
Split with the feature that reduces uncertainty the most:
To avoid overfitting, we may tune the following hyperparameters to regularize decision tree learning.
- the maximum depth of the tree
- the minimum number of samples required to split a node
- the minimum number of samples required to be at a leaf node
- the maximum number of leaf nodes
- the minimum decrease in uncertainty required for learning to continue
- the minumum uncertainty required for splitting a node
Given a new sample , move down the tree by choosing the right path at each split. Finally, you end up in a leaf node that tells you which class the sample belongs to.
- criterion: the measure of uncertainty (gini or entropy)
- max_depth: the maximum depth of the tree (None if no restriction)
- min_samples_split: the minimum number of samples required to create a split
- min_samples_leaf: the minimum number of samples required at a leaf node
- class_weight: weights associated with the classes
- uniform: every class receives the same weight.
- balanced: class weights are inversely proportional to class frequencies.
Check out the documentation listed below to view the attributes that are available in sklearn but not exposed to the user in the software.
- sklearn tutorial on Decision Trees.
- sklearn
DecisionTreeClassifier
documentation.