Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

h #12

Open
wants to merge 145 commits into
base: main
Choose a base branch
from
Open

h #12

Changes from 1 commit
Commits
Show all changes
145 commits
Select commit Hold shift + click to select a range
aa7eb11
Update .gitignore to exclude OSX specific files
dhesenkamp Oct 4, 2021
f6e4072
Merge remote-tracking branch 'upstream/main' into main
dhesenkamp Oct 5, 2021
00d49da
Merge branch 'lbechberger:main' into main
imartirosov Oct 5, 2021
beee080
Added uniform classifier
dhesenkamp Oct 5, 2021
d806664
Added F1 score evaluation metric
dhesenkamp Oct 5, 2021
8c1addb
Merge pull request #1 from dhesenkamp/classifier
dhesenkamp Oct 6, 2021
5d35082
Added tweet tokenization
dhesenkamp Oct 6, 2021
4b43523
Update preprocessing.sh
dhesenkamp Oct 6, 2021
5f186cc
Create Documentation.md
dhesenkamp Oct 6, 2021
537acf7
Merge pull request #2 from dhesenkamp/tokenizer
dhesenkamp Oct 6, 2021
ec5eeeb
Modified punctuation_remover.py
dhesenkamp Oct 7, 2021
841745f
Merge pull request #4 from dhesenkamp/tokenizer
dhesenkamp Oct 7, 2021
473af68
Revert "Merge pull request #1 from dhesenkamp/classifier"
dhesenkamp Oct 7, 2021
f7e9e15
Merge branch 'main' of https://github.com/dhesenkamp/MLinPractice
dhesenkamp Oct 7, 2021
7dfd46a
Resolve merge conflict
dhesenkamp Oct 7, 2021
2f6b559
Resolve merge conflict
dhesenkamp Oct 7, 2021
b92fea9
Merge branch 'lbechberger:main' into main
dhesenkamp Oct 7, 2021
a377bcc
Update Documentation.md
dhesenkamp Oct 7, 2021
189843b
Merge branch 'main' of https://github.com/dhesenkamp/MLinPractice
dhesenkamp Oct 7, 2021
69fbf07
Testing of tokenize_input
dhesenkamp Oct 7, 2021
5d4f975
Added stopword remover
dhesenkamp Oct 7, 2021
d577dc3
Refined stopword remover
dhesenkamp Oct 7, 2021
50c422f
Update stopword_remover.py
dhesenkamp Oct 7, 2021
126812d
Further refining of stopword remover
dhesenkamp Oct 8, 2021
e8d5b86
StopwordRemover(), minor changes
dhesenkamp Oct 8, 2021
45e049b
Short info on Cohen's kappa
dhesenkamp Oct 8, 2021
f8f9ef3
Merge pull request #6 from dhesenkamp/stop_word_removal
dhesenkamp Oct 8, 2021
f83dc31
Added Lemmatizer() class
dhesenkamp Oct 8, 2021
ca27622
Added command line arguments etc
dhesenkamp Oct 8, 2021
2e25f66
Merge branch 'lbechberger:main' into main
dhesenkamp Oct 8, 2021
f46090b
Merge pull request #7 from dhesenkamp/lemmatizer
dhesenkamp Oct 11, 2021
4a4f00f
Merge branch 'lbechberger:main' into main
dhesenkamp Oct 11, 2021
0aa2c88
Update README.md
dhesenkamp Oct 11, 2021
71c4ef9
Merge branch 'main' of https://github.com/dhesenkamp/MLinPractice
dhesenkamp Oct 11, 2021
263e39d
Added feature extraction for month
dhesenkamp Oct 12, 2021
84dafa0
Command line args for month extractor
dhesenkamp Oct 12, 2021
1e849de
Trying to resolve merge conflict manually
dhesenkamp Oct 12, 2021
179d236
Merge pull request #8 from dhesenkamp/feature_month
dhesenkamp Oct 12, 2021
04c515a
Merge conflict, readme, documentation
dhesenkamp Oct 12, 2021
c794c7a
Added SentimentAnalyzer class
dhesenkamp Oct 13, 2021
f896c03
SentimentAnalyzer() command line args + script
dhesenkamp Oct 13, 2021
41e6385
readme and documentation for SentimentAnalyzer
dhesenkamp Oct 13, 2021
4fc38ec
Merge feature_sentiment into main
dhesenkamp Oct 13, 2021
aa41d1b
Update readme.md wrt SentimentAnalyser
dhesenkamp Oct 13, 2021
18fd5df
mlflow added to README.md
dhesenkamp Oct 13, 2021
7c59805
Added decision tree classifier
dhesenkamp Oct 13, 2021
5c67318
Param optimization
dhesenkamp Oct 14, 2021
98e646f
Update .gitignore
dhesenkamp Oct 14, 2021
4704f20
Removed .DS_Store
dhesenkamp Oct 14, 2021
f99d6bf
Classifier testing
dhesenkamp Oct 15, 2021
259d0dd
Update .gitignore
dhesenkamp Oct 15, 2021
edd8a82
Added SVM classifier
dhesenkamp Oct 19, 2021
02e9982
Update .gitignore to exlcude mlruns subfolder
dhesenkamp Oct 19, 2021
a8e4099
SVM classifier testing
dhesenkamp Oct 20, 2021
f670e81
Merge pull request for classifier_svm
dhesenkamp Oct 20, 2021
88ffa97
Implemented Photos feature extractor
dhesenkamp Oct 21, 2021
0c6a790
Command line arguments for Photos feature extractor
dhesenkamp Oct 21, 2021
17181d9
Troubleshooting & testing
dhesenkamp Oct 21, 2021
9e61fc4
Testing complete for Photos() feature extractor
dhesenkamp Oct 21, 2021
b1c2e6d
Merge pull request feature_photos
dhesenkamp Oct 21, 2021
065092e
Created & implemented Mentions() feature extractor
dhesenkamp Oct 21, 2021
bfc71e4
Command line args for Mentions() feature extractor
dhesenkamp Oct 21, 2021
bb41b84
Mentions() feature extractor testing
dhesenkamp Oct 21, 2021
690c621
Merge pull request feature_mention
dhesenkamp Oct 21, 2021
bd37301
Merge conflict - manual resolve
dhesenkamp Oct 21, 2021
92fca4b
Update feature_extraction.sh
dhesenkamp Oct 21, 2021
fe60d7e
Manually resolved merge conflict of previous pull request from featur…
dhesenkamp Oct 21, 2021
4aa460e
URL() feature extractor testing
dhesenkamp Oct 21, 2021
31d794c
Variable renaming
dhesenkamp Oct 21, 2021
da54a04
Feature extraction script testing
dhesenkamp Oct 21, 2021
3badd5b
Update stopword_remover.py
dhesenkamp Oct 21, 2021
1b912ba
Updated svm classifier
dhesenkamp Oct 21, 2021
3c61ef7
Pipeline testing
dhesenkamp Oct 21, 2021
2083d71
Created retweets.py, Update extract_features.py
Yannik101010 Oct 21, 2021
f984812
Update examples.py, feature_extraction.py, feature_extraction.sh
Yannik101010 Oct 21, 2021
6890fdb
Create replies.py
Yannik101010 Oct 21, 2021
900e9fb
Update util.py, feature_extraction.py feature_extraction.sh
Yannik101010 Oct 21, 2021
c7e7d1d
Update extract_features.py, replies.py
Yannik101010 Oct 21, 2021
45b5395
Created hastags.py; Update util.py, extract_feature.py, extract_featu…
Yannik101010 Oct 22, 2021
b2198bc
Update classification.sh
dhesenkamp Oct 22, 2021
db5c2f8
Merge branch 'main' of https://github.com/dhesenkamp/MLinPractice
dhesenkamp Oct 22, 2021
160f08a
Merge conflict - random forest classifier
dhesenkamp Oct 22, 2021
a132258
Merge conflict Likes() feature extractor
dhesenkamp Oct 22, 2021
35a70ae
Pipeline testing
dhesenkamp Oct 22, 2021
cac72e5
Create daytime.py
Yannik101010 Oct 23, 2021
eea0bbd
Update daytime.py, extract_feature.sh, extract_feature.py, util.py, e…
Yannik101010 Oct 23, 2021
ed23f5c
Update run_classifier.py, classification.sh
Yannik101010 Oct 23, 2021
38db958
Daytime() feature extractor (added one-hot)
dhesenkamp Oct 24, 2021
2ea07f8
Update classification.sh
dhesenkamp Oct 24, 2021
62da2ee
Merge conflict
dhesenkamp Oct 24, 2021
ff83062
manual merge
dhesenkamp Oct 24, 2021
a455a0f
Merge pull request #15 from dhesenkamp/feature_daytime
dhesenkamp Oct 24, 2021
40fe057
Update daytime.py
Yannik101010 Oct 24, 2021
ce10d4b
Update daytime.py
Yannik101010 Oct 25, 2021
cdf5305
Update .gitignore
dhesenkamp Oct 25, 2021
a6eaaf4
Update .gitignore
dhesenkamp Oct 25, 2021
d710a30
Update Documentation.md
dhesenkamp Oct 25, 2021
081795c
Update Documentation.md
dhesenkamp Oct 26, 2021
762fe55
Corrections, code examples
dhesenkamp Oct 26, 2021
cbc45cc
Added documentation for lemmatization
dhesenkamp Oct 26, 2021
3d5debc
Lemmatization
dhesenkamp Oct 27, 2021
fa5696c
Fixed StopwordRemover()
dhesenkamp Oct 27, 2021
bd44893
Added all feature extraction steps
dhesenkamp Oct 27, 2021
bc01ffc
Merge pull request #16 from dhesenkamp/documentation-visualization
dhesenkamp Oct 27, 2021
860d200
Created ner.py
Yannik101010 Oct 28, 2021
e3595a9
Update .gitignore
dhesenkamp Oct 29, 2021
3c7aa56
Update .gitignore
dhesenkamp Oct 29, 2021
6e0af72
Untrack classifier.pickle file (too big)
dhesenkamp Oct 29, 2021
ef5e7a2
Updates to .gitignore - untracking of some previously tracked files
dhesenkamp Oct 29, 2021
be6f28b
Minor cleanup, documentation
dhesenkamp Oct 29, 2021
0c96227
Added weights arg to knn classifier
dhesenkamp Oct 29, 2021
56c5d3b
Added criterion for split to decision tree
dhesenkamp Oct 29, 2021
89505da
Added additional cl args for random forest + updated documentation
dhesenkamp Oct 29, 2021
c7f95ea
Removed standardization for random forest (not needed)
dhesenkamp Oct 29, 2021
cad897d
Update examples.py, ner.py, extract_feature.py, feature_extraction.sh
Yannik101010 Oct 30, 2021
5df0545
Fine tuning for NER() feature extractor
dhesenkamp Oct 30, 2021
a29e5d1
Revert "Fine tuning for NER() feature extractor"
dhesenkamp Oct 30, 2021
d48bb55
Fine tuning NER() - manually resolving merge conflict
dhesenkamp Oct 30, 2021
3f94e33
manually resolve merge conflict
dhesenkamp Oct 30, 2021
6a76e57
Merge pull request #17 from dhesenkamp/ner
dhesenkamp Oct 30, 2021
7456d26
Documentation + minor cleanup
dhesenkamp Oct 30, 2021
857007d
Added MLP classifier
dhesenkamp Oct 30, 2021
f8970ae
Merge pull request #18 from dhesenkamp/classifier_mlp
dhesenkamp Oct 30, 2021
01c18f3
Added Gaussian NB classifier
dhesenkamp Oct 30, 2021
b4517de
Changed from Gaussian to Complement NB
dhesenkamp Oct 30, 2021
69b68b8
Updated SentimentAnalyzer to only return pos values
dhesenkamp Oct 30, 2021
5c0f26d
Merge pull request #19 from dhesenkamp/classifier_bayes
dhesenkamp Oct 30, 2021
6532381
Update: Clean Code
Yannik101010 Oct 30, 2021
45ed17b
Updated documentation
dhesenkamp Oct 30, 2021
87025a5
Update README.md
Yannik101010 Oct 30, 2021
0b69335
Merge branch 'Readme' into main1
Yannik101010 Oct 30, 2021
b150f53
Added evaluation section to documentation
dhesenkamp Oct 31, 2021
68e1101
Updated classifier to work for param optimization
dhesenkamp Oct 31, 2021
ecd08a0
Hyperparameter optimization script
dhesenkamp Oct 31, 2021
c3c6665
Update Documentation.md
dhesenkamp Oct 31, 2021
605bb61
Summary plots for evaluation metrics
dhesenkamp Oct 31, 2021
f572b02
Added plots for visualization of results to documentation
dhesenkamp Oct 31, 2021
ef63032
Added more plots with summary stats
dhesenkamp Oct 31, 2021
a325be9
Merge pull request #20 from dhesenkamp/param_optimization
dhesenkamp Oct 31, 2021
eb22e87
Documentation + visuals
dhesenkamp Oct 31, 2021
2a5d349
Merge pull request #21 from dhesenkamp/param_optimization
dhesenkamp Oct 31, 2021
f09d753
Added .py file for plots
dhesenkamp Oct 31, 2021
1ac4d25
Update Documentation.md
dhesenkamp Oct 31, 2021
8c94aba
Added tracking results from param optimization
dhesenkamp Oct 31, 2021
8cebc47
Added missing resources & citations
dhesenkamp Nov 1, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update Documentation.md
Description of preprocessing steps
  • Loading branch information
dhesenkamp committed Oct 26, 2021
commit 081795ce58b2ada841962169ff96291a212d16e6
48 changes: 37 additions & 11 deletions Documentation.md
Original file line number Diff line number Diff line change
@@ -20,10 +20,12 @@ Group members: Dennis Hesenkamp, Iolanta Martirosov, Yannik Ullrich

This document contains the documentation for our project, which aims to classify tweets as viral/non-viral based on multiple features derived from

- the meta data of the tweet and
- the metadata of the tweet and
- the natural language features of the tweet.

The data set used is Ruchi Bhatia's [Data Science Tweets 2010-2021](https://www.kaggle.com/ruchi798/data-science-tweets) from [Kaggle](https://www.kaggle.com/). The code base on which we built our machine learning pipeline was provided by Lucas Bechberger (lecturer) and can be found [here](https://github.com/lbechberger/MLinPractice).
The data set used is Ruchi Bhatia's [Data Science Tweets 2010-2021](https://www.kaggle.com/ruchi798/data-science-tweets) from [Kaggle](https://www.kaggle.com/). The code base on which we built our machine learning pipeline was provided by Lucas Bechberger (lecturer) and can be found [here](https://github.com/lbechberger/MLinPractice).

<p style='color:red'><b>On which basis have the labels in the data set been assigned?</b></p>


<!-- Preprocessing section -->
@@ -36,17 +38,30 @@ The data set provides the raw tweet as it has been posted as well as multiple fe
### Tokenization
In the lecture, Lucas implemented a tokenizer to disassemble tweets into individual words using the `nltk` library[^nltk]. This is done to split up the raw tweet into its constituents, i.e. the single words and punctuation signs it contains. By doing so, further processing and feature extraction can be performed by looking at the single components of a sentence/tweet as opposed to working with one long string.

### Stop word removal
Example:

```python
sent = [These new data will ultimately help scientists more accurately project the fate of the glacier]

# after tokenization:
sent_token = ['These', 'new', 'data', 'will', 'ultimately', 'help', 'scientists', 'more', 'accurately', 'project', 'the', 'fate', 'of', 'the', 'glacier']
```


### Stopword removal
To extract meaningful natural language features from a string, it makes sense to first remove any stopwords occuring in that string. Say, for example, one would like to look at the most frequently occuring words in a large corpus. Usually, that means looking at words which actually carry _meaning_ in the given context. According to the OEC[^oec], the largest 21<sup>st</sup>-century English text corpus, the commonest word in English is _the_ - from which we cannot derive any meaning. Hence, it would make sense to remove words such as _the_ and other, non-meaning carrying words (= stopwords) from a corpus (the set of tweets in our case) before doing anything like keyword of occurence frequency analysis.

### Punctuation removal
A feature for removing punctuation from the raw tweet has already been implemented by @lbechberger.
There is not one universal stopword list nor are there universal rules on how stopwords should be defined. For the sake of convenience, we decided to use `nltk`'s stopword corpus[^nltk_stopwords], an annotated corpus with 2.400 stopwords from 11 languages which we enhancedations. It contains high-frequency words with little lexical content to which we a few more strings, for instance _https_ or _\&amp;_ to account for link prefixes and special character denotationsadded . Other options would have been `gensim`'s `gensim.parsing.preprocessing.remove_stopwords` function[^gensim_stopwords] or `spaCy`'s stopword list[^spacy_stopwords], but since we already used the `nltk` library, we wanted to stay in that ecosystem.

### Punctuation Removal
Punctuation removal follows the same rationale as stopword removal: A dot or exclamation mark will probably occur often in the corpus, but without carrying much meaning at first sight (but we can actually also infer features from punctuation, more about that in [Sentiment Analysis](#sentiment_analysis)). A feature for removing punctuation from the raw tweet has already been implemented by Lucas during the lecture using the `string` library. Again, alternatives can be used - for example with `gensim`, which offers a function for punctuation removal[^gensim-punctuation]. We decided not to change anything here, as the implemented method worked fine (and there is not much benefit in looking at a different list of punctuation signs anyways, as opposed to stopword lists, which can very quite a lot).

### Lemmatization
Lemmatization modifies an inflected or variant form of a word into its lemma or dictionary form.
Through lemmatization, we can make sure that words - on a sematical level - get interpreted in the same way,
even when inflected: 'walk' and 'walking', for example, stem from the same word and ultimately have the same meaning.
Lemmatization, as opposed to stemming, which is computationally more effective, tries to take context into account,
which is why we chose to implement it instead of stemming.
Lemmatization modifies an inflected or variant form of a word into its lemma or dictionary form. Through lemmatization, we can make sure that words - on a sematical level - get interpreted in the same way, even when inflected: _walk_ and _walking_, for example, stem from the same word and ultimately carry the same meaning. Lemmatization, as opposed to stemming, which is computationally more effective, tries to take context into account, which is why we chose to implement it instead of stemming.


gensim.parsing.preprocessing.stem
https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.stem

<!-- Feature extraction section -->
<a name='feature_extraction'></a>
@@ -62,7 +77,8 @@ Does the month in which the tweet was published have an impact on its virality?
the potential to go viral is higher, e.g. holiday season? Using the `datetime` module, we extract the month in which a
tweet was published from the metadata.

### Sentiment analysis
<a name='sentiment_analysis'></a>
### Sentiment Analysis
Using the VADER (Valence Aware Dictionary and sEntiment Reasoner) framework ([PyPI](https://pypi.org/project/vaderSentiment/))
or [homepage](https://github.com/cjhutto/vaderSentiment )), we extract the sentiment of a tweet. VADER was built
for social media and takes into account, among other factors, emojis, punctuation, and caps. The `polarity_score()` function
@@ -84,8 +100,18 @@ unknown words, however, are simply classified as neutral.
Robust against class imbalance


## Conclusion
Different reserach questions:
<p style='color:red'>How does tweet metadata play into virality?</p>



<!-- Footnotes -->
[^nltk]: <https://www.nltk.org/>
[^oec]: <https://web.archive.org/web/20111226085859/http://oxforddictionaries.com/words/the-oec-facts-about-the-language>, retrieved Oct 26, 2021
[^nltk_stopwords]: <https://www.nltk.org/book/ch02.html>, retrieved Oct 26, 2021
[^gensim_stopwords]: <https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.remove_stopwords>, retireved Oct 26, 2021
[^spacy_stopwords]: <https://github.com/explosion/spaCy/blob/master/spacy/lang/en/stop_words.py>, retrieved Oct 26, 2021
[^gensim-punctuation]: <https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.strip_punctuation>, retrieved Oct 26, 2021

<!-- -->