Performed sentiment analysis for XYZ company on playstore reviews to categorize customer reviews as 'POSITIVE' or 'NEGATIVE'
AIM: To perform sentiment analysis on Google Play Store app reviews, classifying them as either "Positive" or "Negative"
DATASET USED: Sample dataset was sourced from kaggle.
TOOLS AND LIBRARIES: This project is made with Python and uses:
- NLTK for text preprocessing
- sci-kit learn for machine learning for ML models (Logistic regression, naive bayes)
- Pandas for data manipulation
- Seaborn and Matplotlib for data visualization (making confusion matrices)
Dataset had 2 useful columns with user reviews and another one with score for those reviews on a scale of 1 to 5, where:
- 1 = Very Negative
- 2 = Negative
- 3 = Neutral
- 4 = Positive
- 5 = Very Positive
Text Cleaning: Used NLTK for:
- Tokenization
- Stop word removal
- Lemmatization
Label Assignment:
Scores of 1 and 2 are labeled as negative
Scores of 4 and 5 are labeled as positive
Neutral Scores (3) are removed from the dataset
TF-IDF (Text-frequency inverse document frequency):
Used TfidfVectorizer to convert the cleaned text into numerical features suitable for machine learning models.
Limited the feature size to 6000 terms for efficient computation while preventing overfit.
- Initially implemented logistic regression
- Accuracy achieved: 87%
- Pros: Simple and easy to interpret, excellent for binary classification
- Cons: Assumes linear relation between features, and best useful when datasets are small- medium sized.
- Decided to implement naive bayes to compare accuracy
- Achieved accuracy of 85%
- Pros: Simple and effective for text processing
- Cons: Assumes no interrelation between words, hence ‘naive’
Logistic Regression was the best-performing model with an accuracy of 87%. Naive Bayes came close but was slightly lower in performance.
Use advanced models and explore word embeddings.