This project aims to build a sentiment analysis pipeline that classifies tweets as positive 😊, negative 😠, or neutral 😐. Currently, the project uses a pre-existing dataset for model training and evaluation. The next stage will enhance the project to analyze real-time tweets fetched using the Twitter Developer API.
The project involves key Natural Language Processing (NLP) techniques like cleaning, tokenization, stopword removal, and stemming, followed by training machine learning models such as Naive Bayes, Support Vector Machines (SVM), and Logistic Regression.
- ✅ Current Stage:
- Model training, evaluation, and testing completed using a pre-existing dataset.
- Preliminary visualizations (e.g., word clouds, confusion matrices) completed.
- 🔧 Next Steps:
- Integration with Twitter Developer API for real-time data 🕒.
- Deploying the project as a web application using Flask or Streamlit for user interaction.
- Continuous model improvement with live data.
- 🔄 Data Preprocessing:
- Cleaning tweets by removing URLs 🌐, punctuation ❗, numbers 🔢, and special characters.
- Tokenization and stopword removal for feature extraction 🛠️.
- Text normalization using stemming 🔍.
- 📊 Visualization:
- Word clouds for positive 💬 and negative 🔴 sentiments.
- Confusion matrices for model evaluation.
- 🧠 Model Training:
- Supports multiple machine learning models:
- 🟡 Naive Bayes
- 🟢 Support Vector Machines (SVM)
- 🔵 Logistic Regression
- Evaluation metrics include accuracy ✅, precision 🎯, recall 🔁, F1-score 🏆, and ROC-AUC curves 📈.
- Supports multiple machine learning models:
- 🌟 Future Plans:
- Real-time tweet collection and analysis.
- Interactive web-based interface for sentiment prediction.
- Languages and Libraries:
- Python 🐍, Numpy, Pandas, Scikit-learn, Matplotlib, Seaborn, NLTK, WordCloud
- Visualization Tools:
- Confusion Matrix, WordCloud, ROC Curve 📉
- Future Integration:
- Twitter Developer API for live tweet collection 🐦
- Flask/Streamlit for web deployment 🌐
- Install Python 3.x 🐍
- Install the required dependencies:
(The
pip install -r requirements.txt
requirements.txt
file includes dependencies like scikit-learn, NLTK, WordCloud, etc.)
-
🗂️ Clone the Repository:
git clone https://github.com/<shreyadata804>/twitter-sentiment-analysis.git cd twitter-sentiment-analysis
-
📂 Data Preparation:
- Pre-existing Dataset:
- Place the dataset in the
/data
folder. - Run the notebook or script to clean, preprocess, and train models.
- Place the dataset in the
- Real-Time Tweets (Future Plan):
- Obtain API keys from the Twitter Developer Platform.
- Configure the
tweepy
library to fetch live tweets.
- Pre-existing Dataset:
-
🚀 Run the Notebook: Open and execute the Jupyter Notebook to preprocess data, train the model, and evaluate results.
-
Analyzing Preloaded Dataset:
- Run the notebook to process the dataset and generate predictions.
-
Future Use Case with Real-time Tweets:
- Fetch live tweets using the Twitter Developer API.
- Pass the tweets through the preprocessing and model pipeline.
- Obtain sentiment predictions.
- ✅ Completed:
- Training and evaluating models using a pre-existing dataset.
- Initial visualizations and exploratory data analysis.
- 🔧 Ongoing:
- Integration with Twitter Developer API for real-time data.
- Deploying the model via Flask/Streamlit.
- 🚀 Future Enhancements:
- Model optimization for better accuracy with large-scale real-time data.
- Addition of advanced NLP techniques like BERT for sentiment classification.
- Sentiment140 Dataset for initial analysis.
- Python libraries like NLTK and Scikit-learn for NLP and machine learning tasks.
- Future reliance on Twitter Developer API for live data integration.
Contributions are welcome! Feel free to fork the project and submit pull requests.
For questions or suggestions, contact:
- Shreya Gupta: [[email protected]]