I developed a flair detector, which detects the flair of any post given from the r/india subreddit.
The root directory has the following components:
- app.py - The main Flask application while which powers the whole UI
- worker - Folder containing the code snippets for data extraction, model generation and prediction
- worker/dataCollector.py - Extracts the data from the r/india subreddit and stores it into csv_data/data.csv
- worker/modelGenerator.py - Reads the data, performs pre-processing tasks and generates various models using a combination of features. The model with the highest frequency goes to the joblib dump
- worker/predictor.py - the core prediction algorithm running after the URL is fed on the main website
- worker/config.py - the credentials used for accessing the Reddit API. Can be altered according to user preference.
- worker/util.py - utility functions for pre-processing tasks
- worker/mongoStorage.py - Python script to store data in MongoDB collections
- Procfile - used for initiating Heroku app
- requirements.txt - Python package requirements, installed on deployment
- .gitignore - files to be ignored by git
- db - contains the MongoDB collections to be accessed by the user
- templates - templates for different webpages, with Jinja2 templating format
The entire codebase is written in the Python language due to ease of usage and powerful computation abilities. I developed the web application using the Flask framework, which is deployed using the gunicorn WSGI server on Heroku.
To run the code locally on your machine, please execute the following commands:
git clone https://github.com/dgupta04/Reddit-Flair-Detector
to clone the repository locallypip install -r requirements.txt
to install all the dependenciesmongod --dbpath="./db"
to run the existing MongoDB instancemongo
to access all the MongoDB collectionsflask run
to run the application locally on the assigned port number.
Performing an ML classification on textual data requires the use of many algorithms for tasks like pre-processing the data, extarcting useful features from it and feeding it into a classifier which generates maximum accuracy.
Although discrete features like upvotes and timestamp; and binary features like Over-18 and isSelf are available to us, these features are not useful in predicting the flair of a post due to huge variations and outlier values. Thus, textual features like comments, the body of the post and its title were used as the main features.
The data consists of 1018 unique reddit posts from the r/india subreddit. The [R]eddiquette flair consisted of a mere 18 posts while the others had 100 posts per flair. Data collection was the longest part of the whole process, and took about an hour for complete retrieval. The data was stored into a CSV file for further usage.
Pre-processing the data involved removal of symbols, punctuation marks and unwanted URLs to generate a cleaner set of words as a whole feature. This was done using the re module in Python.
The cleaned data is split into training and testing sets. In my example, the test data consists of 10% of the overall data, which makes it 102 test samples, leading to a a high number of test samples.
Raw textual data cannot be used as a feature since machines understand numbers only. For this purpose, scikit-learn offers powerul processing tools such as the TfidfVectorizer and the CountVectorizer. The text was fed into these vectorizers and analysed further.
The TfidfVectorizer was used as the choice of vectorizer to derive better performance of the model. Further, the output yielded by the vectorizer was pipelined with different models. The models used to fit the data include:
- Multinomial Naive Bayes Classifier
- Logistic Regression Classifier
- Multi-Layer Perceptron (MLP) Classifier
- Random Forrest Classifier
The results yielded by each of these classifiers are as follows:
Classifier | Accuracy |
---|---|
Multinomial Naive Bayes | 51.96078431372549% |
Logistic Regression Classifier | 59.80392156862745% |
Multi-Layer Perceptron Classifier | 53.92156862745098% |
Random Forrest Classifier | 59.80392156862745% |
Classifier | Accuracy |
---|---|
Multinomial Naive Bayes | 24.50980392156862% |
Logistic Regression Classifier | 24.509803921568626% |
Multi-Layer Perceptron Classifier | 21.568627450980393% |
Random Forrest Classifier | 31.37254901960784% |
Classifier | Accuracy |
---|---|
Multinomial Naive Bayes | 65.68627450980392% |
Logistic Regression Classifier | 65.68627450980392% |
Multi-Layer Perceptron Classifier | 57.84313725490197% |
Random Forrest Classifier | 67.64705882352942% |
Classifier | Accuracy |
---|---|
Multinomial Naive Bayes | 64.70588235294118% |
Logistic Regression Classifier | 78.43137254901961% |
Multi-Layer Perceptron Classifier | 65.68627450980392% |
Random Forrest Classifier | 82.35294117647058% |
Classifier | Accuracy |
---|---|
Multinomial Naive Bayes | 59.80392156862745% |
Logistic Regression Classifier | 69.6078431372549% |
Multi-Layer Perceptron Classifier | 61.76470588235294% |
Random Forrest Classifier | 69.6078431372549% |
Classifier | Accuracy |
---|---|
Multinomial Naive Bayes | 71.56862745098039% |
Logistic Regression Classifier | 75.49019607843137% |
Multi-Layer Perceptron Classifier | 57.84313725490197% |
Random Forrest Classifier | 77.45098039215687% |
Classifier | Accuracy |
---|---|
Multinomial Naive Bayes | 69.6078431372549% |
Logistic Regression Classifier | 79.41176470588235% |
Multi-Layer Perceptron Classifier | 71.56862745098039% |
Random Forrest Classifier | 85.29411764705882% |
It was observed that the Random Forrest Classifier gave the highest accuracy of around 85% using (title + body + comment) as the features. Moreover, using post body as the only feature gave the worst accuracies on all classifiers. Thus, a combination of features was used to predict the flair correctly.