Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge tweetclean to main #18

Open
wants to merge 87 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
b530c66
- adds main function to most scripts
TobiObeck Oct 6, 2021
7b953d6
adds multiple evaluation metrics for classifier
TobiObeck Oct 6, 2021
de3084f
Merge branch 'main' into add-evaluation-metrics
TobiObeck Oct 6, 2021
86a9eb1
separates loading of data sets into setup.sh script
TobiObeck Oct 6, 2021
929db48
Merge branch 'main' of https://github.com/lbechberger/MLinPractice
TobiObeck Oct 7, 2021
ae3c345
Update util.py
pariyashu Oct 7, 2021
0daf32a
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice
TobiObeck Oct 7, 2021
a7a61aa
Merge commit 'a7c7fdb9a3ff5a5af9677b5808279c5f8b018662'
TobiObeck Oct 7, 2021
ee9f4c0
adds parsing of tokenized tweets example
TobiObeck Oct 7, 2021
fde0818
adds func to document tests
TobiObeck Oct 7, 2021
a9ce4de
moves tests into respective code folder
TobiObeck Oct 7, 2021
f37740f
adds documentation how to run tests
TobiObeck Oct 7, 2021
303ba92
test connection
pariyashu Oct 9, 2021
7c873ac
remove unnecessary comment
pariyashu Oct 9, 2021
d0cb537
adds counting of mentions & removal of orig. column
TobiObeck Oct 10, 2021
39b1bdc
minor cleanup
TobiObeck Oct 10, 2021
6483c10
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice
TobiObeck Oct 10, 2021
8988ef6
adds MentionsCounter Preprocessor
TobiObeck Oct 10, 2021
8e92246
minor cleanup
TobiObeck Oct 10, 2021
cf0c340
test connection
pariyashu Oct 9, 2021
ca148df
remove unnecessary comment
pariyashu Oct 9, 2021
5e03e47
Merge pull request #1 from TobiObeck/mentions-count-col
pariyashu Oct 10, 2021
c45a142
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice into…
pariyashu Oct 10, 2021
8777d37
filter language
pariyashu Oct 10, 2021
240dc9a
drop columns inc eng
pariyashu Oct 10, 2021
9ea4c4e
implements column remover as proper preprocessor
TobiObeck Oct 12, 2021
bebf224
Merge branch 'main' of https://github.com/lbechberger/MLinPractice in…
TobiObeck Oct 12, 2021
62b0de2
gets rid of warning by specifying dtypes while reading csv
TobiObeck Oct 13, 2021
6e19a6d
gets rid of warning by specifying dtypes while reading csv
TobiObeck Oct 13, 2021
fba4d6b
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice
TobiObeck Oct 13, 2021
6fdf847
renames folder code -> src
TobiObeck Oct 19, 2021
eaf0117
adds test for counting feature
TobiObeck Oct 19, 2021
33c42bc
cleanup
TobiObeck Oct 19, 2021
d584460
minor changes
TobiObeck Oct 19, 2021
a821f94
renames folder code -> src
TobiObeck Oct 19, 2021
00c91be
stores mlflow and pickle data
TobiObeck Oct 19, 2021
39b3875
Merge branch 'temp-pull-grid-from-bech'
TobiObeck Oct 19, 2021
a07f531
separates examples into corresponding files
TobiObeck Oct 19, 2021
6c860e8
adds randomforest classifier
TobiObeck Oct 22, 2021
2227e83
disables dimensionality reduction
TobiObeck Oct 22, 2021
46d98b4
adds a classification run for all classifiers
TobiObeck Oct 22, 2021
036ec09
adds shebang line for bash scripts
TobiObeck Oct 24, 2021
e81c3f3
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice into…
TobiObeck Oct 24, 2021
b372ca5
mention counter
pariyashu Oct 26, 2021
bf3bf1e
Update run_preprocessing.py
pariyashu Oct 26, 2021
cc0e100
Revert "mention counter"
TobiObeck Oct 27, 2021
a29b5bd
Revert "Update run_preprocessing.py"
TobiObeck Oct 27, 2021
9212938
adds visualization showing language distribution
TobiObeck Nov 1, 2021
f31982e
adds a test for evaluation metrics
TobiObeck Nov 2, 2021
070c757
minor cleanup
TobiObeck Nov 2, 2021
efef989
adds sentiment analysis feature (WIP)
TobiObeck Nov 2, 2021
a73e745
properly implements sentiment analysis
TobiObeck Nov 3, 2021
b1b5ac8
allows naming of classific. runs in mlflow logs
TobiObeck Nov 3, 2021
189fb8f
adds classific. run after implemented sentiment feature
TobiObeck Nov 3, 2021
80b9599
adds more count features
TobiObeck Nov 3, 2021
43d1cca
adds classific. run after adding more count features
TobiObeck Nov 3, 2021
f2cc2aa
adds grid search for optimal hyperparameters
TobiObeck Nov 6, 2021
303ff3f
adds folder for documentation
TobiObeck Nov 7, 2021
1fcafa8
adds docs introduction
TobiObeck Nov 7, 2021
fa5af65
docs: slight improvement of introduction
TobiObeck Nov 7, 2021
e2afa28
docs: adds evaluation (WIP)
TobiObeck Nov 7, 2021
5caf5bb
finishes evaluation
TobiObeck Nov 9, 2021
d23a285
adds preprocessing
TobiObeck Nov 10, 2021
6e93071
docs: improves preprocessing
TobiObeck Nov 11, 2021
2f9b373
docs: adds ml flow results screenshot
TobiObeck Nov 11, 2021
894f244
adds sklearn grid search for random forest classifier
TobiObeck Nov 12, 2021
56afea3
docs: adds counting feature extractor
TobiObeck Nov 12, 2021
0e68b30
docs: adds sentiment feature extractor
TobiObeck Nov 12, 2021
88a0b9b
docs: revises sentiment feature extraction
TobiObeck Nov 13, 2021
6b96d83
docs: fixes image link
TobiObeck Nov 13, 2021
30cdd50
docs: adds random forest classifier motivation
TobiObeck Nov 13, 2021
0eb2a41
docs: adds Results and more features
TobiObeck Nov 13, 2021
260fa13
docs: adds Results and for even more features
TobiObeck Nov 14, 2021
cac898d
docs: adds beginning of Hyperparameter Optimization
TobiObeck Nov 14, 2021
c12b6bf
minor cleanup
TobiObeck Nov 14, 2021
96d177f
docs: rest of hyper param optim and conclusion
TobiObeck Nov 14, 2021
7cd2735
docs: revises some sentences across whole text
TobiObeck Nov 14, 2021
027882a
fixes GridSearchCV
TobiObeck Nov 15, 2021
3c12159
Merge branch 'optimization'
TobiObeck Nov 15, 2021
83e4ab7
Merge branch 'documentation-2'
TobiObeck Nov 15, 2021
a9f3518
docs: updates readme according to the code
TobiObeck Nov 15, 2021
b4c34a3
docs: refines classification flags
TobiObeck Nov 15, 2021
507d027
docs: specifies 5 scenarios for classification.sh
TobiObeck Nov 15, 2021
da77c62
docs: fixes grammar, punctuation and typos
TobiObeck Nov 15, 2021
1cd1d8e
docs: fixes more grammar, punctuation and typos
TobiObeck Nov 15, 2021
f283e8d
moves preprocessors and feature extractors in sub-folder
TobiObeck Nov 15, 2021
5378948
tiny cleanup
TobiObeck Nov 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
adds test for counting feature
  • Loading branch information
TobiObeck committed Oct 19, 2021
commit eaf01174de488f7dcf818be7d0211b517b7f8c32
40 changes: 40 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python: Attach",
"type": "python",
"request": "attach",
"connect": {
"host": "localhost",
"port": 5678
}
},
{
"name": "Python: Module",
"type": "python",
"request": "launch",
"module": "code",
"cwd": "${workspaceFolder}",
},
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"cwd": "${workspaceFolder}",
// "pythonArgs": [
// "-m",
// "src.feature_extraction.test.feature_extraction_test",
// "E:\\MyPC\\code\\git\\myforkMLiP\\MLinPractice\\src\\feature_extraction\\test\\feature_extraction_test.py"
// ],
// "env": {
// "PYTHONPATH": "${workspaceFolder}/code"
// }
}
]
}
42 changes: 35 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -58,7 +58,7 @@ All python scripts and classes for the preprocessing of the input data can be fo
### Creating Labels

The script `create_labels.py` assigns labels to the raw data points based on a threshold on a linear combination of the number of likes and retweets. It is executed as follows:
```python -m code.preprocessing.create_labels path/to/input_dir path/to/output.csv```
```python -m src.preprocessing.create_labels path/to/input_dir path/to/output.csv```
Here, `input_dir` is the directory containing the original raw csv files, while `output.csv` is the single csv file where the output will be written.
The script takes the following optional parameters:
- `-l` or `--likes_weight` determines the relative weight of the number of likes a tweet has received. Defaults to 1.
@@ -68,7 +68,7 @@ The script takes the following optional parameters:
### Classical Preprocessing

The script `run_preprocessing.py` is used to run various preprocessing steps on the raw data, producing additional columns in the csv file. It is executed as follows:
```python -m code.preprocessing.run_preprocessing path/to/input.csv path/to/output.csv```
```python -m src.preprocessing.run_preprocessing path/to/input.csv path/to/output.csv```
Here, `input.csv` is a csv file (ideally the output of `create_labels.py`), while `output.csv` is the csv file where the output will be written.
The preprocessing steps to take can be configured with the following flags:
- `-p` or `--punctuation`: A new column "tweet_no_punctuation" is created, where all punctuation is removed from the original tweet. (See `code/preprocessing/punctuation_remover.py` for more details)
@@ -80,7 +80,7 @@ Moreover, the script accepts the following optional parameters:
### Splitting the Data Set

The script `split_data.py` splits the overall preprocessed data into training, validation, and test set. It can be invoked as follows:
```python -m code.preprocessing.split_data path/to/input.csv path/to/output_dir```
```python -m src.preprocessing.split_data path/to/input.csv path/to/output_dir```
Here, `input.csv` is the input csv file to split (containing a column "label" with the label information, i.e., `create_labels.py` needs to be run beforehand) and `output_dir` is the directory where three individual csv files `training.csv`, `validation.csv`, and `test.csv` will be stored.
The script takes the following optional parameters:
- `-t` or `--test_size` determines the relative size of the test set and defaults to 0.2 (i.e., 20 % of the data).
@@ -93,7 +93,7 @@ The script takes the following optional parameters:
All python scripts and classes for feature extraction can be found in `code/feature_extraction/`.

The script `extract_features.py` takes care of the overall feature extraction process and can be invoked as follows:
```python -m code.feature_extraction.extract_features path/to/input.csv path/to/output.pickle```
```python -m src.feature_extraction.extract_features path/to/input.csv path/to/output.pickle```
Here, `input.csv` is the respective training, validation, or test set file created by `split_data.py`. The file `output.pickle` will be used to store the results of the feature extraction process, namely a dictionary with the following entries:
- `"features"`: a numpy array with the raw feature values (rows are training examples, colums are features)
- `"feature_names"`: a list of feature names for the columns of the numpy array
@@ -112,7 +112,7 @@ All python scripts and classes for dimensionality reduction can be found in `cod

The script `reduce_dimensionality.py` takes care of the overall dimensionality reduction procedure and can be invoked as follows:

```python -m code.dimensionality_reduction.reduce_dimensionality path/to/input.pickle path/to/output.pickle```
```python -m src.dimensionality_reduction.reduce_dimensionality path/to/input.pickle path/to/output.pickle```
Here, `input.pickle` is the respective training, validation, or test set file created by `extract_features.py`.
The file `output.pickle` will be used to store the results of the dimensionality reduction process, containing `"features"` (which are the selected/projected ones) and `"labels"` (same as in the input file).

@@ -132,7 +132,7 @@ All python scripts and classes for classification can be found in `code/classifi
### Train and Evaluate a Single Classifier

The script `run_classifier.py` can be used to train and/or evaluate a given classifier. It can be executed as follows:
```python -m code.classification.run_classifier path/to/input.pickle```
```python -m src.classification.run_classifier path/to/input.pickle```
Here, `input.pickle` is a pickle file of the respective data subset, produced by either `extract_features.py` or `reduce_dimensionality.py`.

By default, this data is used to train a **classifier**, which is specified by one of the following optional arguments:
@@ -166,5 +166,33 @@ All python code for the application demo can be found in `code/application/`.

The script `application.py` provides a simple command line interface, where the user is asked to type in their prospective tweet, which is then analyzed using the trained ML pipeline.
The script can be invoked as follows:
```python -m code.application.application path/to/preprocessing.pickle path/to/feature_extraction.pickle path/to/dimensionality_reduction.pickle path/to/classifier.pickle```
```python -m src.application.application path/to/preprocessing.pickle path/to/feature_extraction.pickle path/to/dimensionality_reduction.pickle path/to/classifier.pickle```
The four pickle files correspond to the exported versions for the different pipeline steps as created by `run_preprocessing.py`, `extract_features.py`, `reduce_dimensionality.py`, and `run_classifier.py`, respectively, with the `-e` option.

## Debugging in Visual Studio Code

1. Running a file in debug mode configured as waiting, because otherwise it woulk just finish to quickly

```
python -m debugpy --wait-for-client --listen 5678 .\src\feature_extraction\test\feature_extraction_test.py
```

2. `launch.json` configuration to attach the editor to the already started debug process.

```json
...
"configurations": [
{
"name": "Python: Attach",
"type": "python",
"request": "attach",
"connect": {
"host": "localhost",
"port": 5678
}
},
]
...
```

3. Start the attach debug configuration via the VS Code UI ([F5] key or `Run`/`Run and Debug` menu)
2 changes: 1 addition & 1 deletion src/feature_extraction/extract_features.py
Original file line number Diff line number Diff line change
@@ -39,7 +39,7 @@ def main():
# character length of original tweet (without any changes)
features.append(CharacterLength(COLUMN_TWEET))
features.append(CounterFE(COLUMN_MENTIONS))
# features.append(CounterFE(COLUMN_PHOTOS))
features.append(CounterFE(COLUMN_PHOTOS))

# create overall FeatureCollector
feature_collector = FeatureCollector(features)
60 changes: 60 additions & 0 deletions src/feature_extraction/test/feature_extraction_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Tests feature extraction
"""

import csv
import logging
import unittest
import pandas as pd
import numpy as np
from src.feature_extraction.counter_fe import CounterFE

class CountFeatureTest(unittest.TestCase):

def setUp(self):
logging.basicConfig()
self.log = logging.getLogger("LOG")

self.tryout_df = pd.read_csv("data/preprocessing/split/training.csv", quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n")
self.tryout_df = self.tryout_df.rename(columns={"mentions": "mockcolumn", "photos": "mockphotos"})

self.INPUT_COLUMN = "mockcolumn"
self.counter_feature = CounterFE(self.INPUT_COLUMN)
# self.df = pd.DataFrame({ self.INPUT_COLUMN: [{'screen_name': 'zeebusiness', 'name': 'zee business', 'id': '140798905'}, {'screen_name': 'amishdevgan', 'name': 'amish devgan', 'id': '163817624'}] } )

self.df = pd.DataFrame()
# self.df[self.INPUT_COLUMN] = "['[\"This\", \"row\", \"has\", \"five\", \"elements\"], [\"this\", \"only\", \"thre\"], [\"one\"], []']"
self.df[self.INPUT_COLUMN] = [
"[{'screen_name': 'zeebusiness', 'name': 'zee business', 'id': '140798905'}, {'screen_name': 'amishdevgan', 'name': 'amish devgan', 'id': '163817624'}]",
"[]",
"[{'screen_name': 'zeebusiness', 'name': 'zee business', 'id': '140798905'}]"
]
print("")


def test_input_columns(self):
self.assertEqual(self.counter_feature._input_columns, [self.INPUT_COLUMN])


def test_feature_name(self):
self.assertEqual(self.counter_feature.get_feature_name(), self.INPUT_COLUMN + "_count")


def test_counting(self):
self.counter_feature.fit(self.df)
actual_feature = self.counter_feature.transform(self.df)
# actual_feature = self.counter_feature.transform(self.tryout_df)
# EXPECTED = np.array(pd.DataFrame({"mockcolumn_count": [5,3,1,0]}))
EXPECTED = np.array(pd.DataFrame({"mockcolumn_count": [2,0,1]}))

# self.log.warning("actual_feature", actual_feature)
# self.log.warning("EXPECTED", EXPECTED)

isEqual = np.array_equal(actual_feature, EXPECTED, equal_nan=False)
self.assertTrue(isEqual)


if __name__ == '__main__':
unittest.main()