lbechberger · Yannik101010 · Oct 4, 2021 · Oct 5, 2021 · Oct 5, 2021 · Oct 5, 2021
diff --git a/.gitignore b/.gitignore
@@ -128,5 +128,23 @@ dmypy.json
 # Pyre type checker
 .pyre/
 
-# exclude /data directory entirely
-/data/
+# exclude csv files from data directory
+data/raw/
+data/preprocessing/
+data/classification/mlflow
+mlruns/
+data/feature_extraction/
+
+# exlcude OSX specific files
+.DS_Store
+**/.DS_Store
+
+# exlcude /mlruns subfolder
+mlruns/
+
+# classifier.pickle file gets too big after certain number of features to classify with
+data/classification/classifier.pickle
+data/feature_extraction/validation.pickle
+data/classification/classifier.pickle
+data/feature_extraction/test.pickle
+data/feature_extraction/training.pickle
diff --git a/Documentation.md b/Documentation.md
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ As data source, we use the "Data Science Tweets 2010-2021" data set (version 3)
 
 In order to install all necessary dependencies, please make sure that you have a local [Conda](https://docs.conda.io/en/latest/) distribution (e.g., Anaconda or miniconda) installed. Begin by creating a new environment called "MLinPractice" that has Python 3.6 installed:
 
-```conda create -y -q --name MLinPractice python=3.6```
+ ```conda create -y -q --name MLinPractice python=3.6```
 
 You can enter this environment with `conda activate MLinPractice` (or `source activate MLinPractice`, if the former does not work). You can leave it with `conda deactivate` (or `source deactivate`, if the former does not work). Enter the environment and execute the following commands in order to install the necessary dependencies (this may take a while):
 
@@ -18,6 +18,11 @@ conda install -y -q -c conda-forge nltk=3.6.3
 conda install -y -q -c conda-forge gensim=4.1.2
 conda install -y -q -c conda-forge spyder=5.1.5
 conda install -y -q -c conda-forge pandas=1.1.5
+conda install -c conda-forge mlflow
+conda install -c conda-forge vadersentiment
+conda install -c conda-forge spacy
+python -m spacy download en_core_web_sm
+
 ```
 
 You can double-check that all of these packages have been installed by running `conda list` inside of your virtual environment. The Spyder IDE can be started by typing `~/miniconda/envs/MLinPractice/bin/spyder` in your terminal window (assuming you use miniconda, which is installed right in your home directory).
@@ -43,7 +48,7 @@ All python scripts and classes for the preprocessing of the input data can be fo
 ### Creating Labels
 
 The script `create_labels.py` assigns labels to the raw data points based on a threshold on a linear combination of the number of likes and retweets. It is executed as follows:
-```python -m code.preprocessing.create_labels path/to/input_dir path/to/output.csv```
+ ```python -m code.preprocessing.create_labels path/to/input_dir path/to/output.csv```
 Here, `input_dir` is the directory containing the original raw csv files, while `output.csv` is the single csv file where the output will be written.
 The script takes the following optional parameters:
 - `-l` or `--likes_weight` determines the relative weight of the number of likes a tweet has received. Defaults to 1.
@@ -53,19 +58,21 @@ The script takes the following optional parameters:
 ### Classical Preprocessing
 
 The script `run_preprocessing.py` is used to run various preprocessing steps on the raw data, producing additional columns in the csv file. It is executed as follows:
-```python -m code.preprocessing.run_preprocessing path/to/input.csv path/to/output.csv```
+ ```python -m code.preprocessing.run_preprocessing path/to/input.csv path/to/output.csv```
 Here, `input.csv` is a csv file (ideally the output of `create_labels.py`), while `output.csv` is the csv file where the output will be written.
 The preprocessing steps to take can be configured with the following flags:
 - `-p` or `--punctuation`: A new column "tweet_no_punctuation" is created, where all punctuation is removed from the original tweet. (See `code/preprocessing/punctuation_remover.py` for more details)
-- `-t`or `--tokenize`: Tokenize the given column (can be specified by `--tokenize_input`, default = "tweet"), and create new column with suffix "_tokenized" containing tokenized tweet.
+- `-t` or `--tokenize`: Tokenize the given column (can be specified by `--tokenize_input`, default = "tweet"), and create new column with suffix "_tokenized" containing tokenized tweet.
+- `-s` or `--stopwords`: Remove common stopwords from the given column (can be specified with `stopwords_input`, default = "tweet"), and returns new column with suffix "_stopwords_removed".
+- `-l`or `--lemmatize`: Modifies inflected or variant words into their base forms (=lemmas). Input column can be specified with `-lemmatize_input` (default = "tweet"), returns new column with suffix "_lemmatized".
 
 Moreover, the script accepts the following optional parameters:
 - `-e` or `--export` gives the path to a pickle file where an sklearn pipeline of the different preprocessing steps will be stored for later usage.
 
 ### Splitting the Data Set
 
 The script `split_data.py` splits the overall preprocessed data into training, validation, and test set. It can be invoked as follows:
-```python -m code.preprocessing.split_data path/to/input.csv path/to/output_dir```
+ ```python -m code.preprocessing.split_data path/to/input.csv path/to/output_dir```
 Here, `input.csv` is the input csv file to split (containing a column "label" with the label information, i.e., `create_labels.py` needs to be run beforehand) and `output_dir` is the directory where three individual csv files `training.csv`, `validation.csv`, and `test.csv` will be stored.
 The script takes the following optional parameters:
 - `-t` or `--test_size` determines the relative size of the test set and defaults to 0.2 (i.e., 20 % of the data).
@@ -77,15 +84,27 @@ The script takes the following optional parameters:
 
 All python scripts and classes for feature extraction can be found in `code/feature_extraction/`.
 
-The script `extract_features.py` takes care of the overall feature extraction process and can be invoked as follows:
-```python -m code.feature_extraction.extract_features path/to/input.csv path/to/output.pickle```
+The script `extract_features.py` takes care of the overall feature extraction process and can be invoked as follows: 
+ ```python -m code.feature_extraction.extract_features path/to/input.csv path/to/output.pickle```
 Here, `input.csv` is the respective training, validation, or test set file created by `split_data.py`. The file `output.pickle` will be used to store the results of the feature extraction process, namely a dictionary with the following entries:
 - `"features"`: a numpy array with the raw feature values (rows are training examples, colums are features)
 - `"feature_names"`: a list of feature names for the columns of the numpy array
 - `"labels"`: a numpy array containing the target labels for the feature vectors (rows are training examples, only column is the label)
 
 The features to be extracted can be configured with the following optional parameters:
 - `-c` or `--char_length`: Count the number of characters in the "tweet" column of the data frame. (see code/feature_extraction/character_length.py)
+- `-m` or `--month`: Extract the month in which the tweet was published from the "date" column of the data frame.
+- `-s` or `--sentiment`: Extract a compound sentiment value from the original tweet using VADER.
+- `-p` or `--photos`: Extracts binary for whether tweet has photo(s) attached from the "photo" column
+- `-@` or `--mention`: Extracts binary for whether someone has been mentioned by the tweet author from the "mention" column
+- `-u` or `--url` Extracts binary for whether a url is attached to tweet from the "url" column
+- `-rt` or `--retweet`: Extracts number of retweets from the "retweet_count" column
+- `-re` or `--replies`: Extracts number of replies from the "replies_count" column
+- `-#` or `--hashtag`: Extracts binary for whether a hashtag is attached to tweet from the "hashtag" column
+- `-l` or `--likes`E xtracts amount of likes from a tweet from the "likes_count" column
+- `-d` or `--daytime`: Extracts at which time of day tweet was tweeted from the "time" column
+- `-n` or `--ner`: Collects the number of entities in a tweet
+
 
 Moreover, the script support importing and exporting fitted feature extractors with the following optional arguments:
 - `-i` or `--import_file`: Load a configured and fitted feature extraction from the given pickle file. Ignore all parameters that configure the features to extract.
@@ -97,7 +116,7 @@ All python scripts and classes for dimensionality reduction can be found in `cod
 
 The script `reduce_dimensionality.py` takes care of the overall dimensionality reduction procedure and can be invoked as follows:
 
-```python -m code.dimensionality_reduction.reduce_dimensionality path/to/input.pickle path/to/output.pickle```
+ ```python -m code.dimensionality_reduction.reduce_dimensionality path/to/input.pickle path/to/output.pickle```
 Here, `input.pickle` is the respective training, validation, or test set file created by `extract_features.py`. 
 The file `output.pickle` will be used to store the results of the dimensionality reduction process, containing `"features"` (which are the selected/projected ones) and `"labels"` (same as in the input file).
 
@@ -117,16 +136,31 @@ All python scripts and classes for classification can be found in `code/classifi
 ### Train and Evaluate a Single Classifier
 
 The script `run_classifier.py` can be used to train and/or evaluate a given classifier. It can be executed as follows:
-```python -m code.classification.run_classifier path/to/input.pickle```
+ ```python -m code.classification.run_classifier path/to/input.pickle```
 Here, `input.pickle` is a pickle file of the respective data subset, produced by either `extract_features.py` or `reduce_dimensionality.py`. 
 
 By default, this data is used to train a classifier, which is specified by one of the following optional arguments:
 - `-m` or `--majority`: Majority vote classifier that always predicts the majority class.
 - `-f` or `--frequency`: Dummy classifier that makes predictions based on the label frequency in the training data.
+- `-u` or  `--uniform`: uniform (random) classifier
+- `--knn`: k nearest neighbor classifier with the specified value of k, default = None
+- `--knn_weights`: weight function of knn which can be optionally be chosen: uniform or distance, default = uniform
+- `--tree`: decision tree classifier with the specified value as max_depth, default = None
+- `--tree_criterion`: Criterion to measure split quality: gini or entropy, default = "gini"
+- `--svm`: Support vector machine with specified kernel: "linear", "polynomial", "rbf", or "sigmoid", default = None
+- `--randforest`: Random forest classifier with specified value as # of trees in forest, default = None
+- `--forest_criterion`: Criterion to measure split quality: "gini" or "entropy", default = "gini"
+- `--forest_max_depth`: max_depth of trees in forest, default = None
+- `--mlp`: Multilayer perceptron classifier, values resemble hidden layer sizes (1 value per layer), default = None
+- `--bayes` Complement naive bayes classifier
+
+
 
 The classifier is then evaluated, using the evaluation metrics as specified through the following optional arguments:
 - `-a`or `--accuracy`: Classification accurracy (i.e., percentage of correctly classified examples).
 - `-k`or `--kappa`: Cohen's kappa (i.e., adjusting accuracy for probability of random agreement).
+- `-f1` or `--f1_score`: F1-score, calculated from precision and recall
+- `-ba`or `--balanced_accuracy`: Balanced classification accuracy 
 
 
 Moreover, the script support importing and exporting trained classifiers with the following optional arguments:
@@ -142,5 +176,5 @@ All python code for the application demo can be found in `code/application/`.
 
 The script `application.py` provides a simple command line interface, where the user is asked to type in their prospective tweet, which is then analyzed using the trained ML pipeline.
 The script can be invoked as follows:
-```python -m code.application.application path/to/preprocessing.pickle path/to/feature_extraction.pickle path/to/dimensionality_reduction.pickle path/to/classifier.pickle```
-The four pickle files correspond to the exported versions for the different pipeline steps as created by `run_preprocessing.py`, `extract_features.py`, `reduce_dimensionality.py`, and `run_classifier.py`, respectively, with the `-e` option.
+ ```python -m code.application.application path/to/preprocessing.pickle path/to/feature_extraction.pickle path/to/dimensionality_reduction.pickle path/to/classifier.pickle```
+The four pickle files correspond to the exported versions for the different pipeline steps as created by `run_preprocessing.py`, `extract_features.py`, `reduce_dimensionality.py`, and `run_classifier.py`, respectively, with the `-e` option.
diff --git a/code/application/application.py b/code/application/application.py
@@ -13,6 +13,7 @@
 from sklearn.pipeline import make_pipeline
 from code.util import COLUMN_TWEET
 
+
 # setting up CLI
 parser = argparse.ArgumentParser(description = "Application")
 parser.add_argument("preprocessing_file", help = "path to the pickle file containing the preprocessing")
@@ -29,7 +30,7 @@
 with open(args.dim_red_file, 'rb') as f_in:
     dimensionality_reduction = pickle.load(f_in)
 with open(args.classifier_file, 'rb') as f_in:
-    classifier = pickle.load(f_in)
+    classifier = pickle.load(f_in)["classifier"]
 
 # chain them together into a single pipeline
 pipeline = make_pipeline(preprocessing, feature_extraction, dimensionality_reduction, classifier)
@@ -56,5 +57,4 @@
     confidence = pipeline.predict_proba(df)
 
     print("Prediction: {0}, Confidence: {1}".format(prediction, confidence))
-    print("")
-
+    print("")
diff --git a/code/classification.sh b/code/classification.sh
@@ -5,10 +5,16 @@ mkdir -p data/classification/
 
 # run feature extraction on training set (may need to fit extractors)
 echo "  training set"
-python -m code.classification.run_classifier data/dimensionality_reduction/training.pickle -e data/classification/classifier.pickle --knn 5 -s 42 --accuracy --kappa
+#python -m code.classification.run_classifier data/dimensionality_reduction/training.pickle -e data/classification/classifier.pickle --svm 'rbf' -s 42 -a -k -f1 -ba
+python -m code.classification.run_classifier data/feature_extraction/training.pickle -e data/classification/classifier.pickle --bayes -s 42 -a -k -f1 -ba
+
 
 # run feature extraction on validation set (with pre-fit extractors)
 echo "  validation set"
-python -m code.classification.run_classifier data/dimensionality_reduction/validation.pickle -i data/classification/classifier.pickle --accuracy --kappa
+#python -m code.classification.run_classifier data/dimensionality_reduction/validation.pickle -i data/classification/classifier.pickle -a -k -f1 -ba
+python -m code.classification.run_classifier data/feature_extraction/validation.pickle -i data/classification/classifier.pickle -a -k -f1 -ba
+
 
-# don't touch the test set, yet, because that would ruin the final generalization experiment!
+# don't touch the test set, yet, because that would ruin the final generalization experiment!
+echo "  test set"
+python -m code.classification.run_classifier data/feature_extraction/test.pickle -i data/classification/classifier.pickle -a -k -f1 -ba