Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge tweetclean to main #18

Open
wants to merge 87 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
b530c66
- adds main function to most scripts
TobiObeck Oct 6, 2021
7b953d6
adds multiple evaluation metrics for classifier
TobiObeck Oct 6, 2021
de3084f
Merge branch 'main' into add-evaluation-metrics
TobiObeck Oct 6, 2021
86a9eb1
separates loading of data sets into setup.sh script
TobiObeck Oct 6, 2021
929db48
Merge branch 'main' of https://github.com/lbechberger/MLinPractice
TobiObeck Oct 7, 2021
ae3c345
Update util.py
pariyashu Oct 7, 2021
0daf32a
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice
TobiObeck Oct 7, 2021
a7a61aa
Merge commit 'a7c7fdb9a3ff5a5af9677b5808279c5f8b018662'
TobiObeck Oct 7, 2021
ee9f4c0
adds parsing of tokenized tweets example
TobiObeck Oct 7, 2021
fde0818
adds func to document tests
TobiObeck Oct 7, 2021
a9ce4de
moves tests into respective code folder
TobiObeck Oct 7, 2021
f37740f
adds documentation how to run tests
TobiObeck Oct 7, 2021
303ba92
test connection
pariyashu Oct 9, 2021
7c873ac
remove unnecessary comment
pariyashu Oct 9, 2021
d0cb537
adds counting of mentions & removal of orig. column
TobiObeck Oct 10, 2021
39b1bdc
minor cleanup
TobiObeck Oct 10, 2021
6483c10
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice
TobiObeck Oct 10, 2021
8988ef6
adds MentionsCounter Preprocessor
TobiObeck Oct 10, 2021
8e92246
minor cleanup
TobiObeck Oct 10, 2021
cf0c340
test connection
pariyashu Oct 9, 2021
ca148df
remove unnecessary comment
pariyashu Oct 9, 2021
5e03e47
Merge pull request #1 from TobiObeck/mentions-count-col
pariyashu Oct 10, 2021
c45a142
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice into…
pariyashu Oct 10, 2021
8777d37
filter language
pariyashu Oct 10, 2021
240dc9a
drop columns inc eng
pariyashu Oct 10, 2021
9ea4c4e
implements column remover as proper preprocessor
TobiObeck Oct 12, 2021
bebf224
Merge branch 'main' of https://github.com/lbechberger/MLinPractice in…
TobiObeck Oct 12, 2021
62b0de2
gets rid of warning by specifying dtypes while reading csv
TobiObeck Oct 13, 2021
6e19a6d
gets rid of warning by specifying dtypes while reading csv
TobiObeck Oct 13, 2021
fba4d6b
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice
TobiObeck Oct 13, 2021
6fdf847
renames folder code -> src
TobiObeck Oct 19, 2021
eaf0117
adds test for counting feature
TobiObeck Oct 19, 2021
33c42bc
cleanup
TobiObeck Oct 19, 2021
d584460
minor changes
TobiObeck Oct 19, 2021
a821f94
renames folder code -> src
TobiObeck Oct 19, 2021
00c91be
stores mlflow and pickle data
TobiObeck Oct 19, 2021
39b3875
Merge branch 'temp-pull-grid-from-bech'
TobiObeck Oct 19, 2021
a07f531
separates examples into corresponding files
TobiObeck Oct 19, 2021
6c860e8
adds randomforest classifier
TobiObeck Oct 22, 2021
2227e83
disables dimensionality reduction
TobiObeck Oct 22, 2021
46d98b4
adds a classification run for all classifiers
TobiObeck Oct 22, 2021
036ec09
adds shebang line for bash scripts
TobiObeck Oct 24, 2021
e81c3f3
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice into…
TobiObeck Oct 24, 2021
b372ca5
mention counter
pariyashu Oct 26, 2021
bf3bf1e
Update run_preprocessing.py
pariyashu Oct 26, 2021
cc0e100
Revert "mention counter"
TobiObeck Oct 27, 2021
a29b5bd
Revert "Update run_preprocessing.py"
TobiObeck Oct 27, 2021
9212938
adds visualization showing language distribution
TobiObeck Nov 1, 2021
f31982e
adds a test for evaluation metrics
TobiObeck Nov 2, 2021
070c757
minor cleanup
TobiObeck Nov 2, 2021
efef989
adds sentiment analysis feature (WIP)
TobiObeck Nov 2, 2021
a73e745
properly implements sentiment analysis
TobiObeck Nov 3, 2021
b1b5ac8
allows naming of classific. runs in mlflow logs
TobiObeck Nov 3, 2021
189fb8f
adds classific. run after implemented sentiment feature
TobiObeck Nov 3, 2021
80b9599
adds more count features
TobiObeck Nov 3, 2021
43d1cca
adds classific. run after adding more count features
TobiObeck Nov 3, 2021
f2cc2aa
adds grid search for optimal hyperparameters
TobiObeck Nov 6, 2021
303ff3f
adds folder for documentation
TobiObeck Nov 7, 2021
1fcafa8
adds docs introduction
TobiObeck Nov 7, 2021
fa5af65
docs: slight improvement of introduction
TobiObeck Nov 7, 2021
e2afa28
docs: adds evaluation (WIP)
TobiObeck Nov 7, 2021
5caf5bb
finishes evaluation
TobiObeck Nov 9, 2021
d23a285
adds preprocessing
TobiObeck Nov 10, 2021
6e93071
docs: improves preprocessing
TobiObeck Nov 11, 2021
2f9b373
docs: adds ml flow results screenshot
TobiObeck Nov 11, 2021
894f244
adds sklearn grid search for random forest classifier
TobiObeck Nov 12, 2021
56afea3
docs: adds counting feature extractor
TobiObeck Nov 12, 2021
0e68b30
docs: adds sentiment feature extractor
TobiObeck Nov 12, 2021
88a0b9b
docs: revises sentiment feature extraction
TobiObeck Nov 13, 2021
6b96d83
docs: fixes image link
TobiObeck Nov 13, 2021
30cdd50
docs: adds random forest classifier motivation
TobiObeck Nov 13, 2021
0eb2a41
docs: adds Results and more features
TobiObeck Nov 13, 2021
260fa13
docs: adds Results and for even more features
TobiObeck Nov 14, 2021
cac898d
docs: adds beginning of Hyperparameter Optimization
TobiObeck Nov 14, 2021
c12b6bf
minor cleanup
TobiObeck Nov 14, 2021
96d177f
docs: rest of hyper param optim and conclusion
TobiObeck Nov 14, 2021
7cd2735
docs: revises some sentences across whole text
TobiObeck Nov 14, 2021
027882a
fixes GridSearchCV
TobiObeck Nov 15, 2021
3c12159
Merge branch 'optimization'
TobiObeck Nov 15, 2021
83e4ab7
Merge branch 'documentation-2'
TobiObeck Nov 15, 2021
a9f3518
docs: updates readme according to the code
TobiObeck Nov 15, 2021
b4c34a3
docs: refines classification flags
TobiObeck Nov 15, 2021
507d027
docs: specifies 5 scenarios for classification.sh
TobiObeck Nov 15, 2021
da77c62
docs: fixes grammar, punctuation and typos
TobiObeck Nov 15, 2021
1cd1d8e
docs: fixes more grammar, punctuation and typos
TobiObeck Nov 15, 2021
f283e8d
moves preprocessors and feature extractors in sub-folder
TobiObeck Nov 15, 2021
5378948
tiny cleanup
TobiObeck Nov 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
finishes evaluation
TobiObeck committed Nov 9, 2021
commit 5caf5bbab05ac1de48cffd1497c6d699a8783610
22 changes: 7 additions & 15 deletions docs/Documentation.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,23 @@
# Documentation - [Patoali](https://trello.com/b/3pj6SkWa)

This document presents the author's work on the 'Machine Learning in Practice' project which took place during the summer term 2021 as a block seminar at Osnabrück University. The given task was to analyze a data set containing data science related tweets and predict whether a tweet will go viral or not with machine learning techniques. A tweet is defined as viral if it exceeds the arbitrary threshold of the sum of 50 likes and retweets. The data set _Data Science Tweets 2010-2021_ contains _data science_, _data analysis_ and _data visualization_ tweets from verified accounts on Twitter from 2010 til 2021. It was collected and [shared on kaggle.com](https://www.kaggle.com/ruchi798/data-science-tweets) by Ruchi Bhatia.
This document presents the author's work on the 'Machine Learning in Practice' project which took place during the summer term 2021 as a block seminar at Osnabrück University. The given task was to analyze a data set containing data science related tweets and predict whether a tweet will go viral or not by applying machine learning techniques. A tweet is defined as viral if it exceeds the arbitrary threshold of the sum of 50 likes and retweets. The data set _Data Science Tweets 2010-2021_ contains _data science_, _data analysis_ and _data visualization_ tweets from verified accounts on Twitter from 2010 til 2021. It was collected and [shared on kaggle.com](https://www.kaggle.com/ruchi798/data-science-tweets) by Ruchi Bhatia.

The lecturer Lucas Bechberger provided his students with a foundational codebase which makes heavy use of the python library scikit-learn. The codebase consists of multiple python (`.py`) and bash (`.sh`) scripts that resemble a basic pipeline of the processing steps _preprocessing_, _feature extraction_, _dimensionality reduction_ and _classification_ which is common for machine learning projects. The shell scripts invoke the python scripts with a particular set of command line arguments. Shell scripts can be used to run the entire pipeline or to execute only individual steps to save time. Results of the pipeline steps are stored in `.pickle` files to reuse them in a separate application. The application offers a rudimentary read–eval–print loop to predict the virality of the tweet a user inputs. The students task was to understand the code base and extend or replace given placeholder implementations with proper solutions to improve and measure the virality prediction.

## Evaluation

Before taking a look at the implemented metrics to judge the prediction performance of various models, some specifics about the data set at hand need to be considered. The raw data consists of the three `.csv` files _data science_, _data analysis_ and _data visualization_. In a first preprocessing step they are appended respectively to form one big data set. In a next step the data is labeled as viral or not viral according to the above mentioned threshold rule. The resulting data set consists of 295.811 tweet records with a distribution of 90.8185% non-viral and 9.1815% viral tweets. Such an uneven distribution of labelling classes is often referred to as an imbalanced data set. This fact has to be taken into account when comparing the results of baselines with classifiers and the selection of suitable metrics.
Before taking a look at the implemented metrics to judge the prediction performance of various models, some specifics about the data set at hand need to be considered. The raw data consists of the three `.csv` files _data science_, _data analysis_ and _data visualization_. In a first preprocessing step they are appended respectively to form one big data set. In a next step the data is labeled as viral or not viral according to the above mentioned threshold rule. The resulting data set consists of 295.811 tweet records with a distribution of 90.82% non-viral and 9.18% viral tweets. Such an uneven distribution of labelling classes is often referred to as an imbalanced data set. This fact has to be taken into account when comparing the results of baselines with classifiers and the selection of suitable metrics.

![TODO](imgs/baselines_2021-11-03_231550.png " ")
<p style="text-align: center;">Fig. 1: Shows the performance of the sklearn DummyClassifier with the strategies 'stratified' and 'most_frequent' on a training and validation data set for all implemented metrics.</p>
<p align="center">Fig. 1: Shows the performance of the sklearn DummyClassifier with the strategies 'stratified' and 'most_frequent' on a training and validation data set for all implemented metrics.</p>

For the baselines a `DummyClassifier` from the sklearn package was used with the `strategy` `most_frequent` and `stratified`. The former applies the rule / means that the most frequent class. The results of the baselines in Fig. 1 show that
For the baselines a `DummyClassifier` from the sklearn package was used with the `strategy` `most_frequent` and `stratified`. The former determines non-viral tweets as the most frequent class and therefore predicts every sample as non-viral. Fig. 1 shows that this rather dumb prediction strategy results in a high accuracy of 90.6%. This is the case, because the calculation of the accuracy metric is based on how many predictions have been correct. Since the data set contains mostly non-viral tweets, the prediction is correct most of the time with a percentage that is similar to the data set's class distribution. The slight difference in the percentage can be explained by the removal of some samples during the preprocessing step.

For evaluation of the prediction performance the following metrics were implemented:
- Cohen's Kappa
- Accuracy
- Precision
- Recall
- F1-Score
- Jaccard
The `stratified` strategy makes prediction by respecting the training set’s class distribution. Again the accuracy has a high value of 83.2% on the validation set. In two observations the accuracy metric performs well on baselines indicating that it is not useful for the imbalanced data set and therefore can be dismissed entirely. The other metrics _Precision_, _Recall_, _F1-Score_, _Cohen's Kappa_ and _Jaccard Score_ are not null this time, but still have a very low value roughly between 0 and 0.1. Some considerations about the othere metrics are discussed in the following paragraphs.

- Cohen's Kappa
- F1-Score
- Jaccard
When selecting metrics, the use case should be taken into account. An average twitter user would expect that most send tweets will not go viral. When such a user would type a potential tweet into our application to find out if it is going to be viral, it is important to detect a tweet which would go viral as such. This can be captured by the recall metric which asks the question _"How many of the true positives did I catch?"_. On the other hand, it would be annoying if the application is not critical enough and classifies a lot of tweets as viral that don't go viral in practice. Such a high rate of false positives is captured by the precision metric which asks "How many positively classified ones are actually positive?". Therefore, both recall and precision are good metrics for the use case.

As a baseline a all true and stratified are used. The
Since the F1-Score combines both recall and precision as a weighted average in a single score, it is a practical approach to ignore the former two and instead just focus on the F1-Score alone. Furthermore, cohen's kappa is a good condidate for an imbalanced data set. In its calculation the accuracy is used, but adjusted by the probability of random agreement and therefore considered as a more robust measure than simple percent agreement calculations. In addition, the Jaccard Score leaves out false negatives in its calculation. Since it can be expected that this is the most frequently appearing type of result in a confusion matrix, the Jaccard Score is also well-suited for the data set. All in all, the metrics _F1-Score_, _Cohen's Kappa_ and _Jaccard Score_ are used to judge about the models prediction performance by comparing the scores of the model two the scores of the chosen baselines.

## Preprocessing