-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat : Added readme to decision trees
- Loading branch information
1 parent
6ecbcb1
commit 34dffef
Showing
1 changed file
with
87 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
# Decision Tree Overview | ||
|
||
## What is a Decision Tree? | ||
|
||
A **Decision Tree** is a machine learning algorithm that splits a dataset based on feature values in the form of a binary tree. It traverses from the root node to the final leaf node to derive predictions. Key characteristics include: | ||
|
||
- **Splitting Mechanism:** The dataset is recursively split based on feature values. | ||
- **Optimization Criteria:** Uses metrics like Entropy or Information Gain to determine the optimal split. The quality of these splits is fundamental to the algorithm's effectiveness. | ||
- **Versatility:** Applicable to both classification and regression tasks. | ||
|
||
## The Art of Splits | ||
|
||
### Entropy | ||
|
||
**Entropy** quantifies the amount of uncertainty or impurity in a dataset at a high level. | ||
|
||
#### Formula for Entropy | ||
|
||
To calculate entropy for a particular split: | ||
|
||
1. **Calculate Probabilities:** Determine the probability of each class in the split. | ||
2. **Compute Logarithms:** Take the logarithm of each probability. | ||
3. **Multiply:** Multiply each probability by its corresponding logarithm. | ||
4. **Sum Up:** Perform the above steps for every class in the leaf and sum the results to obtain the entropy of that leaf. | ||
|
||
**Formula:** | ||
\[ | ||
\text{Entropy} = -\sum_{i=1}^{n} p_i \log(p_i) | ||
\] | ||
where \( p_i \) is the probability of class \( i \). | ||
|
||
#### Why Use Logarithms of Probabilities? | ||
|
||
- **Weighting Rare Events:** An event with a probability of 0.01 is rarer and carries more weight than an event with a probability of 0.9. Logarithms scale these probabilities non-linearly, which is beneficial for emphasizing rare events. | ||
- **Mathematical Properties:** The logarithm function is defined as: | ||
- \( \log(x) = y \) is equivalent to \( 10^y = x \). | ||
- As \( x \) grows exponentially, \( y \) increases very slowly. | ||
- **Examples:** | ||
- \( \log(10) = 1 \) | ||
- \( \log(100) = 2 \) | ||
- \( \log(1000) = 3 \) | ||
- \( \log(10^{16}) = 16 \) | ||
|
||
This property ensures that large changes in \( x \) result in only modest changes in \( y \), preventing any single class from disproportionately influencing the entropy. | ||
|
||
### Other Splitting Criteria | ||
|
||
There are several other methods for determining the optimal splits in a decision tree: | ||
|
||
- **Information Gain:** Measures the reduction in entropy achieved by partitioning the data. | ||
- **Gini Impurity:** Evaluates the likelihood of incorrect classification of a randomly chosen element. | ||
- **Gradient-Based Splitting:** Utilizes gradient information for more advanced splitting strategies. | ||
|
||
These methods will be covered in detail in subsequent sections. | ||
|
||
## Workflow of Decision Tree Construction | ||
|
||
The process of building a decision tree involves the following steps: | ||
|
||
1. **Iterate Through Features:** | ||
- Loop through each feature in the dataset. | ||
|
||
2. **Determine Feature Range:** | ||
- Identify the minimum and maximum values of the current feature. | ||
|
||
3. **Generate Split Points:** | ||
- Sort the feature values and calculate the mean value between two consecutive positions to create potential split points. | ||
|
||
4. **Split the Dataset:** | ||
- For each mean value, divide the dataset into two subsets: | ||
- **Left Leaf:** Instances with feature values less than or equal to the mean. | ||
- **Right Leaf:** Instances with feature values greater than the mean. | ||
|
||
5. **Calculate Entropy:** | ||
- For each resulting leaf, determine the class distribution and compute the entropy. | ||
- Weight the entropy of each leaf according to the class distribution. | ||
- Sum the weighted entropies to obtain the overall entropy of the split. | ||
|
||
6. **Evaluate Split Quality:** | ||
- Compare the calculated entropy with the current best entropy. | ||
- If the new entropy is lower, update the best split to this node. | ||
|
||
7. **Recursive Splitting:** | ||
- Apply the above steps recursively to each leaf node. | ||
- Continue until a predefined depth limit or another stopping criterion is reached. | ||
|
||
This recursive process ensures that the tree optimally partitions the data, minimizing entropy and enhancing prediction accuracy. |