From 34dffef3d9e21f3a6250a335d59438c6f5fe722e Mon Sep 17 00:00:00 2001 From: Arjun Date: Sun, 8 Dec 2024 13:39:11 +0530 Subject: [PATCH] feat : Added readme to decision trees --- trees/decision_trees_ground_up/README.MD | 87 ++++++++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 trees/decision_trees_ground_up/README.MD diff --git a/trees/decision_trees_ground_up/README.MD b/trees/decision_trees_ground_up/README.MD new file mode 100644 index 0000000..a918df2 --- /dev/null +++ b/trees/decision_trees_ground_up/README.MD @@ -0,0 +1,87 @@ +# Decision Tree Overview + +## What is a Decision Tree? + +A **Decision Tree** is a machine learning algorithm that splits a dataset based on feature values in the form of a binary tree. It traverses from the root node to the final leaf node to derive predictions. Key characteristics include: + +- **Splitting Mechanism:** The dataset is recursively split based on feature values. +- **Optimization Criteria:** Uses metrics like Entropy or Information Gain to determine the optimal split. The quality of these splits is fundamental to the algorithm's effectiveness. +- **Versatility:** Applicable to both classification and regression tasks. + +## The Art of Splits + +### Entropy + +**Entropy** quantifies the amount of uncertainty or impurity in a dataset at a high level. + +#### Formula for Entropy + +To calculate entropy for a particular split: + +1. **Calculate Probabilities:** Determine the probability of each class in the split. +2. **Compute Logarithms:** Take the logarithm of each probability. +3. **Multiply:** Multiply each probability by its corresponding logarithm. +4. **Sum Up:** Perform the above steps for every class in the leaf and sum the results to obtain the entropy of that leaf. + +**Formula:** +\[ +\text{Entropy} = -\sum_{i=1}^{n} p_i \log(p_i) +\] +where \( p_i \) is the probability of class \( i \). + +#### Why Use Logarithms of Probabilities? + +- **Weighting Rare Events:** An event with a probability of 0.01 is rarer and carries more weight than an event with a probability of 0.9. Logarithms scale these probabilities non-linearly, which is beneficial for emphasizing rare events. +- **Mathematical Properties:** The logarithm function is defined as: + - \( \log(x) = y \) is equivalent to \( 10^y = x \). + - As \( x \) grows exponentially, \( y \) increases very slowly. + - **Examples:** + - \( \log(10) = 1 \) + - \( \log(100) = 2 \) + - \( \log(1000) = 3 \) + - \( \log(10^{16}) = 16 \) + +This property ensures that large changes in \( x \) result in only modest changes in \( y \), preventing any single class from disproportionately influencing the entropy. + +### Other Splitting Criteria + +There are several other methods for determining the optimal splits in a decision tree: + +- **Information Gain:** Measures the reduction in entropy achieved by partitioning the data. +- **Gini Impurity:** Evaluates the likelihood of incorrect classification of a randomly chosen element. +- **Gradient-Based Splitting:** Utilizes gradient information for more advanced splitting strategies. + +These methods will be covered in detail in subsequent sections. + +## Workflow of Decision Tree Construction + +The process of building a decision tree involves the following steps: + +1. **Iterate Through Features:** + - Loop through each feature in the dataset. + +2. **Determine Feature Range:** + - Identify the minimum and maximum values of the current feature. + +3. **Generate Split Points:** + - Sort the feature values and calculate the mean value between two consecutive positions to create potential split points. + +4. **Split the Dataset:** + - For each mean value, divide the dataset into two subsets: + - **Left Leaf:** Instances with feature values less than or equal to the mean. + - **Right Leaf:** Instances with feature values greater than the mean. + +5. **Calculate Entropy:** + - For each resulting leaf, determine the class distribution and compute the entropy. + - Weight the entropy of each leaf according to the class distribution. + - Sum the weighted entropies to obtain the overall entropy of the split. + +6. **Evaluate Split Quality:** + - Compare the calculated entropy with the current best entropy. + - If the new entropy is lower, update the best split to this node. + +7. **Recursive Splitting:** + - Apply the above steps recursively to each leaf node. + - Continue until a predefined depth limit or another stopping criterion is reached. + +This recursive process ensures that the tree optimally partitions the data, minimizing entropy and enhancing prediction accuracy.