Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

h #12

Open
wants to merge 145 commits into
base: main
Choose a base branch
from
Open

h #12

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
145 commits
Select commit Hold shift + click to select a range
aa7eb11
Update .gitignore to exclude OSX specific files
dhesenkamp Oct 4, 2021
f6e4072
Merge remote-tracking branch 'upstream/main' into main
dhesenkamp Oct 5, 2021
00d49da
Merge branch 'lbechberger:main' into main
imartirosov Oct 5, 2021
beee080
Added uniform classifier
dhesenkamp Oct 5, 2021
d806664
Added F1 score evaluation metric
dhesenkamp Oct 5, 2021
8c1addb
Merge pull request #1 from dhesenkamp/classifier
dhesenkamp Oct 6, 2021
5d35082
Added tweet tokenization
dhesenkamp Oct 6, 2021
4b43523
Update preprocessing.sh
dhesenkamp Oct 6, 2021
5f186cc
Create Documentation.md
dhesenkamp Oct 6, 2021
537acf7
Merge pull request #2 from dhesenkamp/tokenizer
dhesenkamp Oct 6, 2021
ec5eeeb
Modified punctuation_remover.py
dhesenkamp Oct 7, 2021
841745f
Merge pull request #4 from dhesenkamp/tokenizer
dhesenkamp Oct 7, 2021
473af68
Revert "Merge pull request #1 from dhesenkamp/classifier"
dhesenkamp Oct 7, 2021
f7e9e15
Merge branch 'main' of https://github.com/dhesenkamp/MLinPractice
dhesenkamp Oct 7, 2021
7dfd46a
Resolve merge conflict
dhesenkamp Oct 7, 2021
2f6b559
Resolve merge conflict
dhesenkamp Oct 7, 2021
b92fea9
Merge branch 'lbechberger:main' into main
dhesenkamp Oct 7, 2021
a377bcc
Update Documentation.md
dhesenkamp Oct 7, 2021
189843b
Merge branch 'main' of https://github.com/dhesenkamp/MLinPractice
dhesenkamp Oct 7, 2021
69fbf07
Testing of tokenize_input
dhesenkamp Oct 7, 2021
5d4f975
Added stopword remover
dhesenkamp Oct 7, 2021
d577dc3
Refined stopword remover
dhesenkamp Oct 7, 2021
50c422f
Update stopword_remover.py
dhesenkamp Oct 7, 2021
126812d
Further refining of stopword remover
dhesenkamp Oct 8, 2021
e8d5b86
StopwordRemover(), minor changes
dhesenkamp Oct 8, 2021
45e049b
Short info on Cohen's kappa
dhesenkamp Oct 8, 2021
f8f9ef3
Merge pull request #6 from dhesenkamp/stop_word_removal
dhesenkamp Oct 8, 2021
f83dc31
Added Lemmatizer() class
dhesenkamp Oct 8, 2021
ca27622
Added command line arguments etc
dhesenkamp Oct 8, 2021
2e25f66
Merge branch 'lbechberger:main' into main
dhesenkamp Oct 8, 2021
f46090b
Merge pull request #7 from dhesenkamp/lemmatizer
dhesenkamp Oct 11, 2021
4a4f00f
Merge branch 'lbechberger:main' into main
dhesenkamp Oct 11, 2021
0aa2c88
Update README.md
dhesenkamp Oct 11, 2021
71c4ef9
Merge branch 'main' of https://github.com/dhesenkamp/MLinPractice
dhesenkamp Oct 11, 2021
263e39d
Added feature extraction for month
dhesenkamp Oct 12, 2021
84dafa0
Command line args for month extractor
dhesenkamp Oct 12, 2021
1e849de
Trying to resolve merge conflict manually
dhesenkamp Oct 12, 2021
179d236
Merge pull request #8 from dhesenkamp/feature_month
dhesenkamp Oct 12, 2021
04c515a
Merge conflict, readme, documentation
dhesenkamp Oct 12, 2021
c794c7a
Added SentimentAnalyzer class
dhesenkamp Oct 13, 2021
f896c03
SentimentAnalyzer() command line args + script
dhesenkamp Oct 13, 2021
41e6385
readme and documentation for SentimentAnalyzer
dhesenkamp Oct 13, 2021
4fc38ec
Merge feature_sentiment into main
dhesenkamp Oct 13, 2021
aa41d1b
Update readme.md wrt SentimentAnalyser
dhesenkamp Oct 13, 2021
18fd5df
mlflow added to README.md
dhesenkamp Oct 13, 2021
7c59805
Added decision tree classifier
dhesenkamp Oct 13, 2021
5c67318
Param optimization
dhesenkamp Oct 14, 2021
98e646f
Update .gitignore
dhesenkamp Oct 14, 2021
4704f20
Removed .DS_Store
dhesenkamp Oct 14, 2021
f99d6bf
Classifier testing
dhesenkamp Oct 15, 2021
259d0dd
Update .gitignore
dhesenkamp Oct 15, 2021
edd8a82
Added SVM classifier
dhesenkamp Oct 19, 2021
02e9982
Update .gitignore to exlcude mlruns subfolder
dhesenkamp Oct 19, 2021
a8e4099
SVM classifier testing
dhesenkamp Oct 20, 2021
f670e81
Merge pull request for classifier_svm
dhesenkamp Oct 20, 2021
88ffa97
Implemented Photos feature extractor
dhesenkamp Oct 21, 2021
0c6a790
Command line arguments for Photos feature extractor
dhesenkamp Oct 21, 2021
17181d9
Troubleshooting & testing
dhesenkamp Oct 21, 2021
9e61fc4
Testing complete for Photos() feature extractor
dhesenkamp Oct 21, 2021
b1c2e6d
Merge pull request feature_photos
dhesenkamp Oct 21, 2021
065092e
Created & implemented Mentions() feature extractor
dhesenkamp Oct 21, 2021
bfc71e4
Command line args for Mentions() feature extractor
dhesenkamp Oct 21, 2021
bb41b84
Mentions() feature extractor testing
dhesenkamp Oct 21, 2021
690c621
Merge pull request feature_mention
dhesenkamp Oct 21, 2021
bd37301
Merge conflict - manual resolve
dhesenkamp Oct 21, 2021
92fca4b
Update feature_extraction.sh
dhesenkamp Oct 21, 2021
fe60d7e
Manually resolved merge conflict of previous pull request from featur…
dhesenkamp Oct 21, 2021
4aa460e
URL() feature extractor testing
dhesenkamp Oct 21, 2021
31d794c
Variable renaming
dhesenkamp Oct 21, 2021
da54a04
Feature extraction script testing
dhesenkamp Oct 21, 2021
3badd5b
Update stopword_remover.py
dhesenkamp Oct 21, 2021
1b912ba
Updated svm classifier
dhesenkamp Oct 21, 2021
3c61ef7
Pipeline testing
dhesenkamp Oct 21, 2021
2083d71
Created retweets.py, Update extract_features.py
Yannik101010 Oct 21, 2021
f984812
Update examples.py, feature_extraction.py, feature_extraction.sh
Yannik101010 Oct 21, 2021
6890fdb
Create replies.py
Yannik101010 Oct 21, 2021
900e9fb
Update util.py, feature_extraction.py feature_extraction.sh
Yannik101010 Oct 21, 2021
c7e7d1d
Update extract_features.py, replies.py
Yannik101010 Oct 21, 2021
45b5395
Created hastags.py; Update util.py, extract_feature.py, extract_featu…
Yannik101010 Oct 22, 2021
b2198bc
Update classification.sh
dhesenkamp Oct 22, 2021
db5c2f8
Merge branch 'main' of https://github.com/dhesenkamp/MLinPractice
dhesenkamp Oct 22, 2021
160f08a
Merge conflict - random forest classifier
dhesenkamp Oct 22, 2021
a132258
Merge conflict Likes() feature extractor
dhesenkamp Oct 22, 2021
35a70ae
Pipeline testing
dhesenkamp Oct 22, 2021
cac72e5
Create daytime.py
Yannik101010 Oct 23, 2021
eea0bbd
Update daytime.py, extract_feature.sh, extract_feature.py, util.py, e…
Yannik101010 Oct 23, 2021
ed23f5c
Update run_classifier.py, classification.sh
Yannik101010 Oct 23, 2021
38db958
Daytime() feature extractor (added one-hot)
dhesenkamp Oct 24, 2021
2ea07f8
Update classification.sh
dhesenkamp Oct 24, 2021
62da2ee
Merge conflict
dhesenkamp Oct 24, 2021
ff83062
manual merge
dhesenkamp Oct 24, 2021
a455a0f
Merge pull request #15 from dhesenkamp/feature_daytime
dhesenkamp Oct 24, 2021
40fe057
Update daytime.py
Yannik101010 Oct 24, 2021
ce10d4b
Update daytime.py
Yannik101010 Oct 25, 2021
cdf5305
Update .gitignore
dhesenkamp Oct 25, 2021
a6eaaf4
Update .gitignore
dhesenkamp Oct 25, 2021
d710a30
Update Documentation.md
dhesenkamp Oct 25, 2021
081795c
Update Documentation.md
dhesenkamp Oct 26, 2021
762fe55
Corrections, code examples
dhesenkamp Oct 26, 2021
cbc45cc
Added documentation for lemmatization
dhesenkamp Oct 26, 2021
3d5debc
Lemmatization
dhesenkamp Oct 27, 2021
fa5696c
Fixed StopwordRemover()
dhesenkamp Oct 27, 2021
bd44893
Added all feature extraction steps
dhesenkamp Oct 27, 2021
bc01ffc
Merge pull request #16 from dhesenkamp/documentation-visualization
dhesenkamp Oct 27, 2021
860d200
Created ner.py
Yannik101010 Oct 28, 2021
e3595a9
Update .gitignore
dhesenkamp Oct 29, 2021
3c7aa56
Update .gitignore
dhesenkamp Oct 29, 2021
6e0af72
Untrack classifier.pickle file (too big)
dhesenkamp Oct 29, 2021
ef5e7a2
Updates to .gitignore - untracking of some previously tracked files
dhesenkamp Oct 29, 2021
be6f28b
Minor cleanup, documentation
dhesenkamp Oct 29, 2021
0c96227
Added weights arg to knn classifier
dhesenkamp Oct 29, 2021
56c5d3b
Added criterion for split to decision tree
dhesenkamp Oct 29, 2021
89505da
Added additional cl args for random forest + updated documentation
dhesenkamp Oct 29, 2021
c7f95ea
Removed standardization for random forest (not needed)
dhesenkamp Oct 29, 2021
cad897d
Update examples.py, ner.py, extract_feature.py, feature_extraction.sh
Yannik101010 Oct 30, 2021
5df0545
Fine tuning for NER() feature extractor
dhesenkamp Oct 30, 2021
a29e5d1
Revert "Fine tuning for NER() feature extractor"
dhesenkamp Oct 30, 2021
d48bb55
Fine tuning NER() - manually resolving merge conflict
dhesenkamp Oct 30, 2021
3f94e33
manually resolve merge conflict
dhesenkamp Oct 30, 2021
6a76e57
Merge pull request #17 from dhesenkamp/ner
dhesenkamp Oct 30, 2021
7456d26
Documentation + minor cleanup
dhesenkamp Oct 30, 2021
857007d
Added MLP classifier
dhesenkamp Oct 30, 2021
f8970ae
Merge pull request #18 from dhesenkamp/classifier_mlp
dhesenkamp Oct 30, 2021
01c18f3
Added Gaussian NB classifier
dhesenkamp Oct 30, 2021
b4517de
Changed from Gaussian to Complement NB
dhesenkamp Oct 30, 2021
69b68b8
Updated SentimentAnalyzer to only return pos values
dhesenkamp Oct 30, 2021
5c0f26d
Merge pull request #19 from dhesenkamp/classifier_bayes
dhesenkamp Oct 30, 2021
6532381
Update: Clean Code
Yannik101010 Oct 30, 2021
45ed17b
Updated documentation
dhesenkamp Oct 30, 2021
87025a5
Update README.md
Yannik101010 Oct 30, 2021
0b69335
Merge branch 'Readme' into main1
Yannik101010 Oct 30, 2021
b150f53
Added evaluation section to documentation
dhesenkamp Oct 31, 2021
68e1101
Updated classifier to work for param optimization
dhesenkamp Oct 31, 2021
ecd08a0
Hyperparameter optimization script
dhesenkamp Oct 31, 2021
c3c6665
Update Documentation.md
dhesenkamp Oct 31, 2021
605bb61
Summary plots for evaluation metrics
dhesenkamp Oct 31, 2021
f572b02
Added plots for visualization of results to documentation
dhesenkamp Oct 31, 2021
ef63032
Added more plots with summary stats
dhesenkamp Oct 31, 2021
a325be9
Merge pull request #20 from dhesenkamp/param_optimization
dhesenkamp Oct 31, 2021
eb22e87
Documentation + visuals
dhesenkamp Oct 31, 2021
2a5d349
Merge pull request #21 from dhesenkamp/param_optimization
dhesenkamp Oct 31, 2021
f09d753
Added .py file for plots
dhesenkamp Oct 31, 2021
1ac4d25
Update Documentation.md
dhesenkamp Oct 31, 2021
8c94aba
Added tracking results from param optimization
dhesenkamp Oct 31, 2021
8cebc47
Added missing resources & citations
dhesenkamp Nov 1, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -128,5 +128,23 @@ dmypy.json
# Pyre type checker
.pyre/

# exclude /data directory entirely
/data/
# exclude csv files from data directory
data/raw/
data/preprocessing/
data/classification/mlflow
mlruns/
data/feature_extraction/

# exlcude OSX specific files
.DS_Store
**/.DS_Store

# exlcude /mlruns subfolder
mlruns/

# classifier.pickle file gets too big after certain number of features to classify with
data/classification/classifier.pickle
data/feature_extraction/validation.pickle
data/classification/classifier.pickle
data/feature_extraction/test.pickle
data/feature_extraction/training.pickle
769 changes: 769 additions & 0 deletions Documentation.md

Large diffs are not rendered by default.

56 changes: 45 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ As data source, we use the "Data Science Tweets 2010-2021" data set (version 3)

In order to install all necessary dependencies, please make sure that you have a local [Conda](https://docs.conda.io/en/latest/) distribution (e.g., Anaconda or miniconda) installed. Begin by creating a new environment called "MLinPractice" that has Python 3.6 installed:

```conda create -y -q --name MLinPractice python=3.6```
```conda create -y -q --name MLinPractice python=3.6```

You can enter this environment with `conda activate MLinPractice` (or `source activate MLinPractice`, if the former does not work). You can leave it with `conda deactivate` (or `source deactivate`, if the former does not work). Enter the environment and execute the following commands in order to install the necessary dependencies (this may take a while):

Expand All @@ -18,6 +18,11 @@ conda install -y -q -c conda-forge nltk=3.6.3
conda install -y -q -c conda-forge gensim=4.1.2
conda install -y -q -c conda-forge spyder=5.1.5
conda install -y -q -c conda-forge pandas=1.1.5
conda install -c conda-forge mlflow
conda install -c conda-forge vadersentiment
conda install -c conda-forge spacy
python -m spacy download en_core_web_sm

```

You can double-check that all of these packages have been installed by running `conda list` inside of your virtual environment. The Spyder IDE can be started by typing `~/miniconda/envs/MLinPractice/bin/spyder` in your terminal window (assuming you use miniconda, which is installed right in your home directory).
Expand All @@ -43,7 +48,7 @@ All python scripts and classes for the preprocessing of the input data can be fo
### Creating Labels

The script `create_labels.py` assigns labels to the raw data points based on a threshold on a linear combination of the number of likes and retweets. It is executed as follows:
```python -m code.preprocessing.create_labels path/to/input_dir path/to/output.csv```
```python -m code.preprocessing.create_labels path/to/input_dir path/to/output.csv```
Here, `input_dir` is the directory containing the original raw csv files, while `output.csv` is the single csv file where the output will be written.
The script takes the following optional parameters:
- `-l` or `--likes_weight` determines the relative weight of the number of likes a tweet has received. Defaults to 1.
Expand All @@ -53,19 +58,21 @@ The script takes the following optional parameters:
### Classical Preprocessing

The script `run_preprocessing.py` is used to run various preprocessing steps on the raw data, producing additional columns in the csv file. It is executed as follows:
```python -m code.preprocessing.run_preprocessing path/to/input.csv path/to/output.csv```
```python -m code.preprocessing.run_preprocessing path/to/input.csv path/to/output.csv```
Here, `input.csv` is a csv file (ideally the output of `create_labels.py`), while `output.csv` is the csv file where the output will be written.
The preprocessing steps to take can be configured with the following flags:
- `-p` or `--punctuation`: A new column "tweet_no_punctuation" is created, where all punctuation is removed from the original tweet. (See `code/preprocessing/punctuation_remover.py` for more details)
- `-t`or `--tokenize`: Tokenize the given column (can be specified by `--tokenize_input`, default = "tweet"), and create new column with suffix "_tokenized" containing tokenized tweet.
- `-t` or `--tokenize`: Tokenize the given column (can be specified by `--tokenize_input`, default = "tweet"), and create new column with suffix "_tokenized" containing tokenized tweet.
- `-s` or `--stopwords`: Remove common stopwords from the given column (can be specified with `stopwords_input`, default = "tweet"), and returns new column with suffix "_stopwords_removed".
- `-l`or `--lemmatize`: Modifies inflected or variant words into their base forms (=lemmas). Input column can be specified with `-lemmatize_input` (default = "tweet"), returns new column with suffix "_lemmatized".

Moreover, the script accepts the following optional parameters:
- `-e` or `--export` gives the path to a pickle file where an sklearn pipeline of the different preprocessing steps will be stored for later usage.

### Splitting the Data Set

The script `split_data.py` splits the overall preprocessed data into training, validation, and test set. It can be invoked as follows:
```python -m code.preprocessing.split_data path/to/input.csv path/to/output_dir```
```python -m code.preprocessing.split_data path/to/input.csv path/to/output_dir```
Here, `input.csv` is the input csv file to split (containing a column "label" with the label information, i.e., `create_labels.py` needs to be run beforehand) and `output_dir` is the directory where three individual csv files `training.csv`, `validation.csv`, and `test.csv` will be stored.
The script takes the following optional parameters:
- `-t` or `--test_size` determines the relative size of the test set and defaults to 0.2 (i.e., 20 % of the data).
Expand All @@ -77,15 +84,27 @@ The script takes the following optional parameters:

All python scripts and classes for feature extraction can be found in `code/feature_extraction/`.

The script `extract_features.py` takes care of the overall feature extraction process and can be invoked as follows:
```python -m code.feature_extraction.extract_features path/to/input.csv path/to/output.pickle```
The script `extract_features.py` takes care of the overall feature extraction process and can be invoked as follows:
```python -m code.feature_extraction.extract_features path/to/input.csv path/to/output.pickle```
Here, `input.csv` is the respective training, validation, or test set file created by `split_data.py`. The file `output.pickle` will be used to store the results of the feature extraction process, namely a dictionary with the following entries:
- `"features"`: a numpy array with the raw feature values (rows are training examples, colums are features)
- `"feature_names"`: a list of feature names for the columns of the numpy array
- `"labels"`: a numpy array containing the target labels for the feature vectors (rows are training examples, only column is the label)

The features to be extracted can be configured with the following optional parameters:
- `-c` or `--char_length`: Count the number of characters in the "tweet" column of the data frame. (see code/feature_extraction/character_length.py)
- `-m` or `--month`: Extract the month in which the tweet was published from the "date" column of the data frame.
- `-s` or `--sentiment`: Extract a compound sentiment value from the original tweet using VADER.
- `-p` or `--photos`: Extracts binary for whether tweet has photo(s) attached from the "photo" column
- `-@` or `--mention`: Extracts binary for whether someone has been mentioned by the tweet author from the "mention" column
- `-u` or `--url` Extracts binary for whether a url is attached to tweet from the "url" column
- `-rt` or `--retweet`: Extracts number of retweets from the "retweet_count" column
- `-re` or `--replies`: Extracts number of replies from the "replies_count" column
- `-#` or `--hashtag`: Extracts binary for whether a hashtag is attached to tweet from the "hashtag" column
- `-l` or `--likes`E xtracts amount of likes from a tweet from the "likes_count" column
- `-d` or `--daytime`: Extracts at which time of day tweet was tweeted from the "time" column
- `-n` or `--ner`: Collects the number of entities in a tweet


Moreover, the script support importing and exporting fitted feature extractors with the following optional arguments:
- `-i` or `--import_file`: Load a configured and fitted feature extraction from the given pickle file. Ignore all parameters that configure the features to extract.
Expand All @@ -97,7 +116,7 @@ All python scripts and classes for dimensionality reduction can be found in `cod

The script `reduce_dimensionality.py` takes care of the overall dimensionality reduction procedure and can be invoked as follows:

```python -m code.dimensionality_reduction.reduce_dimensionality path/to/input.pickle path/to/output.pickle```
```python -m code.dimensionality_reduction.reduce_dimensionality path/to/input.pickle path/to/output.pickle```
Here, `input.pickle` is the respective training, validation, or test set file created by `extract_features.py`.
The file `output.pickle` will be used to store the results of the dimensionality reduction process, containing `"features"` (which are the selected/projected ones) and `"labels"` (same as in the input file).

Expand All @@ -117,16 +136,31 @@ All python scripts and classes for classification can be found in `code/classifi
### Train and Evaluate a Single Classifier

The script `run_classifier.py` can be used to train and/or evaluate a given classifier. It can be executed as follows:
```python -m code.classification.run_classifier path/to/input.pickle```
```python -m code.classification.run_classifier path/to/input.pickle```
Here, `input.pickle` is a pickle file of the respective data subset, produced by either `extract_features.py` or `reduce_dimensionality.py`.

By default, this data is used to train a classifier, which is specified by one of the following optional arguments:
- `-m` or `--majority`: Majority vote classifier that always predicts the majority class.
- `-f` or `--frequency`: Dummy classifier that makes predictions based on the label frequency in the training data.
- `-u` or `--uniform`: uniform (random) classifier
- `--knn`: k nearest neighbor classifier with the specified value of k, default = None
- `--knn_weights`: weight function of knn which can be optionally be chosen: uniform or distance, default = uniform
- `--tree`: decision tree classifier with the specified value as max_depth, default = None
- `--tree_criterion`: Criterion to measure split quality: gini or entropy, default = "gini"
- `--svm`: Support vector machine with specified kernel: "linear", "polynomial", "rbf", or "sigmoid", default = None
- `--randforest`: Random forest classifier with specified value as # of trees in forest, default = None
- `--forest_criterion`: Criterion to measure split quality: "gini" or "entropy", default = "gini"
- `--forest_max_depth`: max_depth of trees in forest, default = None
- `--mlp`: Multilayer perceptron classifier, values resemble hidden layer sizes (1 value per layer), default = None
- `--bayes` Complement naive bayes classifier



The classifier is then evaluated, using the evaluation metrics as specified through the following optional arguments:
- `-a`or `--accuracy`: Classification accurracy (i.e., percentage of correctly classified examples).
- `-k`or `--kappa`: Cohen's kappa (i.e., adjusting accuracy for probability of random agreement).
- `-f1` or `--f1_score`: F1-score, calculated from precision and recall
- `-ba`or `--balanced_accuracy`: Balanced classification accuracy


Moreover, the script support importing and exporting trained classifiers with the following optional arguments:
Expand All @@ -142,5 +176,5 @@ All python code for the application demo can be found in `code/application/`.

The script `application.py` provides a simple command line interface, where the user is asked to type in their prospective tweet, which is then analyzed using the trained ML pipeline.
The script can be invoked as follows:
```python -m code.application.application path/to/preprocessing.pickle path/to/feature_extraction.pickle path/to/dimensionality_reduction.pickle path/to/classifier.pickle```
The four pickle files correspond to the exported versions for the different pipeline steps as created by `run_preprocessing.py`, `extract_features.py`, `reduce_dimensionality.py`, and `run_classifier.py`, respectively, with the `-e` option.
```python -m code.application.application path/to/preprocessing.pickle path/to/feature_extraction.pickle path/to/dimensionality_reduction.pickle path/to/classifier.pickle```
The four pickle files correspond to the exported versions for the different pipeline steps as created by `run_preprocessing.py`, `extract_features.py`, `reduce_dimensionality.py`, and `run_classifier.py`, respectively, with the `-e` option.
6 changes: 3 additions & 3 deletions code/application/application.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from sklearn.pipeline import make_pipeline
from code.util import COLUMN_TWEET


# setting up CLI
parser = argparse.ArgumentParser(description = "Application")
parser.add_argument("preprocessing_file", help = "path to the pickle file containing the preprocessing")
Expand All @@ -29,7 +30,7 @@
with open(args.dim_red_file, 'rb') as f_in:
dimensionality_reduction = pickle.load(f_in)
with open(args.classifier_file, 'rb') as f_in:
classifier = pickle.load(f_in)
classifier = pickle.load(f_in)["classifier"]

# chain them together into a single pipeline
pipeline = make_pipeline(preprocessing, feature_extraction, dimensionality_reduction, classifier)
Expand All @@ -56,5 +57,4 @@
confidence = pipeline.predict_proba(df)

print("Prediction: {0}, Confidence: {1}".format(prediction, confidence))
print("")

print("")
12 changes: 9 additions & 3 deletions code/classification.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,16 @@ mkdir -p data/classification/

# run feature extraction on training set (may need to fit extractors)
echo " training set"
python -m code.classification.run_classifier data/dimensionality_reduction/training.pickle -e data/classification/classifier.pickle --knn 5 -s 42 --accuracy --kappa
#python -m code.classification.run_classifier data/dimensionality_reduction/training.pickle -e data/classification/classifier.pickle --svm 'rbf' -s 42 -a -k -f1 -ba
python -m code.classification.run_classifier data/feature_extraction/training.pickle -e data/classification/classifier.pickle --bayes -s 42 -a -k -f1 -ba


# run feature extraction on validation set (with pre-fit extractors)
echo " validation set"
python -m code.classification.run_classifier data/dimensionality_reduction/validation.pickle -i data/classification/classifier.pickle --accuracy --kappa
#python -m code.classification.run_classifier data/dimensionality_reduction/validation.pickle -i data/classification/classifier.pickle -a -k -f1 -ba
python -m code.classification.run_classifier data/feature_extraction/validation.pickle -i data/classification/classifier.pickle -a -k -f1 -ba


# don't touch the test set, yet, because that would ruin the final generalization experiment!
# don't touch the test set, yet, because that would ruin the final generalization experiment!
echo " test set"
python -m code.classification.run_classifier data/feature_extraction/test.pickle -i data/classification/classifier.pickle -a -k -f1 -ba
Loading