Skip to content

Commit

Permalink
feat : Added readme to decision trees
Browse files Browse the repository at this point in the history
  • Loading branch information
arjunprakash027 committed Dec 8, 2024
1 parent 6ecbcb1 commit 34dffef
Showing 1 changed file with 87 additions and 0 deletions.
87 changes: 87 additions & 0 deletions trees/decision_trees_ground_up/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Decision Tree Overview

## What is a Decision Tree?

A **Decision Tree** is a machine learning algorithm that splits a dataset based on feature values in the form of a binary tree. It traverses from the root node to the final leaf node to derive predictions. Key characteristics include:

- **Splitting Mechanism:** The dataset is recursively split based on feature values.
- **Optimization Criteria:** Uses metrics like Entropy or Information Gain to determine the optimal split. The quality of these splits is fundamental to the algorithm's effectiveness.
- **Versatility:** Applicable to both classification and regression tasks.

## The Art of Splits

### Entropy

**Entropy** quantifies the amount of uncertainty or impurity in a dataset at a high level.

#### Formula for Entropy

To calculate entropy for a particular split:

1. **Calculate Probabilities:** Determine the probability of each class in the split.
2. **Compute Logarithms:** Take the logarithm of each probability.
3. **Multiply:** Multiply each probability by its corresponding logarithm.
4. **Sum Up:** Perform the above steps for every class in the leaf and sum the results to obtain the entropy of that leaf.

**Formula:**
\[
\text{Entropy} = -\sum_{i=1}^{n} p_i \log(p_i)
\]
where \( p_i \) is the probability of class \( i \).

#### Why Use Logarithms of Probabilities?

- **Weighting Rare Events:** An event with a probability of 0.01 is rarer and carries more weight than an event with a probability of 0.9. Logarithms scale these probabilities non-linearly, which is beneficial for emphasizing rare events.
- **Mathematical Properties:** The logarithm function is defined as:
- \( \log(x) = y \) is equivalent to \( 10^y = x \).
- As \( x \) grows exponentially, \( y \) increases very slowly.
- **Examples:**
- \( \log(10) = 1 \)
- \( \log(100) = 2 \)
- \( \log(1000) = 3 \)
- \( \log(10^{16}) = 16 \)

This property ensures that large changes in \( x \) result in only modest changes in \( y \), preventing any single class from disproportionately influencing the entropy.

### Other Splitting Criteria

There are several other methods for determining the optimal splits in a decision tree:

- **Information Gain:** Measures the reduction in entropy achieved by partitioning the data.
- **Gini Impurity:** Evaluates the likelihood of incorrect classification of a randomly chosen element.
- **Gradient-Based Splitting:** Utilizes gradient information for more advanced splitting strategies.

These methods will be covered in detail in subsequent sections.

## Workflow of Decision Tree Construction

The process of building a decision tree involves the following steps:

1. **Iterate Through Features:**
- Loop through each feature in the dataset.

2. **Determine Feature Range:**
- Identify the minimum and maximum values of the current feature.

3. **Generate Split Points:**
- Sort the feature values and calculate the mean value between two consecutive positions to create potential split points.

4. **Split the Dataset:**
- For each mean value, divide the dataset into two subsets:
- **Left Leaf:** Instances with feature values less than or equal to the mean.
- **Right Leaf:** Instances with feature values greater than the mean.

5. **Calculate Entropy:**
- For each resulting leaf, determine the class distribution and compute the entropy.
- Weight the entropy of each leaf according to the class distribution.
- Sum the weighted entropies to obtain the overall entropy of the split.

6. **Evaluate Split Quality:**
- Compare the calculated entropy with the current best entropy.
- If the new entropy is lower, update the best split to this node.

7. **Recursive Splitting:**
- Apply the above steps recursively to each leaf node.
- Continue until a predefined depth limit or another stopping criterion is reached.

This recursive process ensures that the tree optimally partitions the data, minimizing entropy and enhancing prediction accuracy.

0 comments on commit 34dffef

Please sign in to comment.