We aim to predict the sentiment of the text provided in the IMDB Dataset of 50K Movie Reviews - Kaggle between 'positive' and 'negative'.
We try three approaches on %20 of the data (of size 10k) that was set aside as the testing set:
- Two-shot LLM evaluation with Qwen2.5: we prompt the LLM with two samples of positive and negative sentiment texts and then ask it to return a new one for the given text. This gives our second to best F1-Score = 0.86.
- TF-IDF: we use term-frequency-inverse-document-frequency features comptued from the training test, train a classifier, and then apply it on the test data. This gives our highest F1-Score = 0.89.
- NLTK-Sentiment Analysis: Finally, we use the package ready sentiment analysis from nltk in python to predict sentiments of each text. This approach has the lowest performance with an F1-Score of 0.67.
Details of the models' performances are provided below:
Classification Report: precision recall f1-score support
0 0.83 0.93 0.88 4961
1 0.92 0.82 0.86 5039
accuracy 0.87 10000
macro avg 0.87 0.87 0.87 10000 weighted avg 0.88 0.87 0.87 10000
Confusion Matrix: [[4592 369] [ 927 4112]]
F1-Score = 0.86
Vectorizing text... Training model...
Classification Report: precision recall f1-score support
0 0.90 0.87 0.89 4961
1 0.88 0.90 0.89 5039
accuracy 0.89 10000
macro avg 0.89 0.89 0.89 10000 weighted avg 0.89 0.89 0.89 10000
Confusion Matrix: [[4332 629] [ 487 4552]]
F1-Score = 0.89
Classification Report: precision recall f1-score support
0 0.78 0.01 0.01 4961
1 0.50 1.00 0.67 5039
accuracy 0.51 10000
macro avg 0.64 0.50 0.34 10000 weighted avg 0.64 0.51 0.34 10000
Confusion Matrix: [[ 29 4932] [ 8 5031]]
F1-Score = 0.67