This is my solution for UCB Data100 project2 Spam/Ham Classification. The notebook is original from Spring 2022.
In this project, participants were asked to create their own classifier to distinguish spam emails from ham (non-spam) emails. They were only allowed to train logistic regression models. No decision trees, random forests, k-nearest-neighbors, neural nets, etc.
Since only logistic regression models were allowed, I focused on feature selection and proposed a feature selection method using conditional probability.
Accuracy of the final model on the training data = 0.998
4-fold cross-validation average accuracy on the training data = 0.977
I focus only on finding words with ability to distinguish emails. Roughly speaking, we need to find words with high frequency of occurrence in spam emails. But a word with a high frequency of occurrence in spam emails may also have a high frequency of occurrence in ham emails, such as a stop word ("the", "a", "and", "in"). So frequency of occurrence is not an appropriate metric for our goal.
Note that our goal is to find "spam words" such that,
If the word appears in an email, that email is likely to be spam, while it is not likely to be ham.
Using conditional probability, this means
Since
Then
We can see that conditional probability is an appropriate metric of our goal.
Meanwhile, finding "ham words" will also be helpful. Therefore, we need to find words such that
or
Let's define a word's "ability to distinguish emails" as
To calculate the conditional probability, we use the formula
and estimate the probability with the frequency.