Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge tweetclean to main #18

Open
wants to merge 87 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
b530c66
- adds main function to most scripts
TobiObeck Oct 6, 2021
7b953d6
adds multiple evaluation metrics for classifier
TobiObeck Oct 6, 2021
de3084f
Merge branch 'main' into add-evaluation-metrics
TobiObeck Oct 6, 2021
86a9eb1
separates loading of data sets into setup.sh script
TobiObeck Oct 6, 2021
929db48
Merge branch 'main' of https://github.com/lbechberger/MLinPractice
TobiObeck Oct 7, 2021
ae3c345
Update util.py
pariyashu Oct 7, 2021
0daf32a
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice
TobiObeck Oct 7, 2021
a7a61aa
Merge commit 'a7c7fdb9a3ff5a5af9677b5808279c5f8b018662'
TobiObeck Oct 7, 2021
ee9f4c0
adds parsing of tokenized tweets example
TobiObeck Oct 7, 2021
fde0818
adds func to document tests
TobiObeck Oct 7, 2021
a9ce4de
moves tests into respective code folder
TobiObeck Oct 7, 2021
f37740f
adds documentation how to run tests
TobiObeck Oct 7, 2021
303ba92
test connection
pariyashu Oct 9, 2021
7c873ac
remove unnecessary comment
pariyashu Oct 9, 2021
d0cb537
adds counting of mentions & removal of orig. column
TobiObeck Oct 10, 2021
39b1bdc
minor cleanup
TobiObeck Oct 10, 2021
6483c10
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice
TobiObeck Oct 10, 2021
8988ef6
adds MentionsCounter Preprocessor
TobiObeck Oct 10, 2021
8e92246
minor cleanup
TobiObeck Oct 10, 2021
cf0c340
test connection
pariyashu Oct 9, 2021
ca148df
remove unnecessary comment
pariyashu Oct 9, 2021
5e03e47
Merge pull request #1 from TobiObeck/mentions-count-col
pariyashu Oct 10, 2021
c45a142
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice into…
pariyashu Oct 10, 2021
8777d37
filter language
pariyashu Oct 10, 2021
240dc9a
drop columns inc eng
pariyashu Oct 10, 2021
9ea4c4e
implements column remover as proper preprocessor
TobiObeck Oct 12, 2021
bebf224
Merge branch 'main' of https://github.com/lbechberger/MLinPractice in…
TobiObeck Oct 12, 2021
62b0de2
gets rid of warning by specifying dtypes while reading csv
TobiObeck Oct 13, 2021
6e19a6d
gets rid of warning by specifying dtypes while reading csv
TobiObeck Oct 13, 2021
fba4d6b
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice
TobiObeck Oct 13, 2021
6fdf847
renames folder code -> src
TobiObeck Oct 19, 2021
eaf0117
adds test for counting feature
TobiObeck Oct 19, 2021
33c42bc
cleanup
TobiObeck Oct 19, 2021
d584460
minor changes
TobiObeck Oct 19, 2021
a821f94
renames folder code -> src
TobiObeck Oct 19, 2021
00c91be
stores mlflow and pickle data
TobiObeck Oct 19, 2021
39b3875
Merge branch 'temp-pull-grid-from-bech'
TobiObeck Oct 19, 2021
a07f531
separates examples into corresponding files
TobiObeck Oct 19, 2021
6c860e8
adds randomforest classifier
TobiObeck Oct 22, 2021
2227e83
disables dimensionality reduction
TobiObeck Oct 22, 2021
46d98b4
adds a classification run for all classifiers
TobiObeck Oct 22, 2021
036ec09
adds shebang line for bash scripts
TobiObeck Oct 24, 2021
e81c3f3
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice into…
TobiObeck Oct 24, 2021
b372ca5
mention counter
pariyashu Oct 26, 2021
bf3bf1e
Update run_preprocessing.py
pariyashu Oct 26, 2021
cc0e100
Revert "mention counter"
TobiObeck Oct 27, 2021
a29b5bd
Revert "Update run_preprocessing.py"
TobiObeck Oct 27, 2021
9212938
adds visualization showing language distribution
TobiObeck Nov 1, 2021
f31982e
adds a test for evaluation metrics
TobiObeck Nov 2, 2021
070c757
minor cleanup
TobiObeck Nov 2, 2021
efef989
adds sentiment analysis feature (WIP)
TobiObeck Nov 2, 2021
a73e745
properly implements sentiment analysis
TobiObeck Nov 3, 2021
b1b5ac8
allows naming of classific. runs in mlflow logs
TobiObeck Nov 3, 2021
189fb8f
adds classific. run after implemented sentiment feature
TobiObeck Nov 3, 2021
80b9599
adds more count features
TobiObeck Nov 3, 2021
43d1cca
adds classific. run after adding more count features
TobiObeck Nov 3, 2021
f2cc2aa
adds grid search for optimal hyperparameters
TobiObeck Nov 6, 2021
303ff3f
adds folder for documentation
TobiObeck Nov 7, 2021
1fcafa8
adds docs introduction
TobiObeck Nov 7, 2021
fa5af65
docs: slight improvement of introduction
TobiObeck Nov 7, 2021
e2afa28
docs: adds evaluation (WIP)
TobiObeck Nov 7, 2021
5caf5bb
finishes evaluation
TobiObeck Nov 9, 2021
d23a285
adds preprocessing
TobiObeck Nov 10, 2021
6e93071
docs: improves preprocessing
TobiObeck Nov 11, 2021
2f9b373
docs: adds ml flow results screenshot
TobiObeck Nov 11, 2021
894f244
adds sklearn grid search for random forest classifier
TobiObeck Nov 12, 2021
56afea3
docs: adds counting feature extractor
TobiObeck Nov 12, 2021
0e68b30
docs: adds sentiment feature extractor
TobiObeck Nov 12, 2021
88a0b9b
docs: revises sentiment feature extraction
TobiObeck Nov 13, 2021
6b96d83
docs: fixes image link
TobiObeck Nov 13, 2021
30cdd50
docs: adds random forest classifier motivation
TobiObeck Nov 13, 2021
0eb2a41
docs: adds Results and more features
TobiObeck Nov 13, 2021
260fa13
docs: adds Results and for even more features
TobiObeck Nov 14, 2021
cac898d
docs: adds beginning of Hyperparameter Optimization
TobiObeck Nov 14, 2021
c12b6bf
minor cleanup
TobiObeck Nov 14, 2021
96d177f
docs: rest of hyper param optim and conclusion
TobiObeck Nov 14, 2021
7cd2735
docs: revises some sentences across whole text
TobiObeck Nov 14, 2021
027882a
fixes GridSearchCV
TobiObeck Nov 15, 2021
3c12159
Merge branch 'optimization'
TobiObeck Nov 15, 2021
83e4ab7
Merge branch 'documentation-2'
TobiObeck Nov 15, 2021
a9f3518
docs: updates readme according to the code
TobiObeck Nov 15, 2021
b4c34a3
docs: refines classification flags
TobiObeck Nov 15, 2021
507d027
docs: specifies 5 scenarios for classification.sh
TobiObeck Nov 15, 2021
da77c62
docs: fixes grammar, punctuation and typos
TobiObeck Nov 15, 2021
1cd1d8e
docs: fixes more grammar, punctuation and typos
TobiObeck Nov 15, 2021
f283e8d
moves preprocessors and feature extractors in sub-folder
TobiObeck Nov 15, 2021
5378948
tiny cleanup
TobiObeck Nov 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
adds docs introduction
TobiObeck committed Nov 7, 2021
commit 1fcafa84037a4a19c890aa915548aedf2cdce83e
6 changes: 5 additions & 1 deletion docs/Documentation.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
# Documentation Example
# Documentation - [Patoali](https://trello.com/b/3pj6SkWa)

This document presents the author's work on the 'Machine Learning in Practice' project which took place during the summer term 2021 as a block seminar at Osnabrück University. The given task was to analyze a data set containing data science related tweets and predict with machine learning models whether a tweet will go viral or not. A tweet is defined as viral if it exceeds the arbitrary threshold of the sum of 50 likes and retweets. The data set _Data Science Tweets 2010-2021_ contains _data science_, _data analysis_ and _data visualizaion_ tweets from verified accounts on Twitter from 2010-2021. It was collected and [shared on kaggle.com](https://www.kaggle.com/ruchi798/data-science-tweets) by Ruchi Bhatia.

The lecturer Lucas Bechberger provided his students with a foundational codebase. The given codebase consists of multiple python (`.py`) and bash (`.sh`) scripts that resemble a basic pipeline of the processing steps _preprocessing_, _feature extraction_, _dimensionality reduction_ and _classification_ which is common for machine learning projects. The shell scripts can be used to run the whole pipeline or to run individual steps by invoking python scripts with specific command line arguments. Results of the pipeline steps are stored in `.pickle` files to reuse them in a separate application (`src\application\application.py`). The application offers a rudimentary Read–eval–print loop to predict the virality of the tweet a user inputs. The students task was to understand the code base and extend or replace given placeholder implementations with proper solutions to imrpove and measure the virality prediction.

## Evaluation