diff --git a/Documentation.md b/Documentation.md index de0a5b0f..f1db8209 100644 --- a/Documentation.md +++ b/Documentation.md @@ -1,59 +1,89 @@ # Documentation Example -Some introductory sentence(s). Data set and task are relatively fixed, so +This is the forked repository for Magnus Müller, Maximilian Kalcher and Samuel Hagemann. + +Our task involved building and documenting a real-life application of machine learning. +We were given a dataset of N tweets from the years X until Y and had to build a classifier that would detect whether a tweet would go viral. +The measure for it being viral was when the sum of likes and retweets were bigger than 50. + +The dataset was very variable and we had a lot of features to work with, which gave us the freedom to choose and experiment with these freely. + +At the end, our classifier is implemented into an 'application', callable by terminal, which gives the likeliness of an input tweet being viral, having used the dataset as training. + +//Some introductory sentence(s). Data set and task are relatively fixed, so probably you don't have much to say about them (unless you modifed them). If you haven't changed the application much, there's also not much to say about that. The following structure thus only covers preprocessing, feature extraction, dimensionality reduction, classification, and evaluation. -## Evaluation +## Preprocessing + +Before using the data or some aspects of it, it is important to process some of it beforehand so our chosen features can be extracted smoothly. +Many tweets had different kind of punctuation, ..., emojis, and some of them even were written in different languages. ### Design Decisions -Which evaluation metrics did you use and why? -Which baselines did you use and why? +After looking at the dataset closely, we chose to keep the core words of the sentence, ... +- remove stopwords like 'a' or 'is' +- remove punctation +- use just englisch tweets +- tokenize ### Results -How do the baselines perform with respect to the evaluation metrics? - +Maybe show a short example what your preprocessing does. +language summary: +({'en': 282035, 'it': 4116, 'es': 3272, 'fr': 2781, 'de': 714, 'id': 523, 'nl': 480, 'pt': 364, 'ca': 275, 'ru': 204, 'th': 157, 'ar': 126, 'tl': 108, 'tr': 84, 'hr': 68, 'da': 66, 'ro': 60, 'ja': 58, 'sv': 42, 'et': 29, 'pl': 25, 'bg': 24, 'af': 23, 'no': 21, 'fi': 20, 'so': 16, 'ta': 16, 'hi': 11, 'mk': 11, 'he': 9, 'sw': 9, 'lt': 7, 'uk': 6, 'sl': 6, 'te': 5, 'zh-cn': 5, 'lv': 5, 'ko': 5, 'bn': 4, 'el': 4, 'fa': 3, 'vi': 2, 'mr': 2, 'ml': 2, 'hu': 2, 'kn': 1, 'cs': 1, 'gu': 1, 'sk': 1, 'ur': 1, 'sq': 1}) +Total: +295811 +English tweets are 95%. So we can delete (maybe later translate) 5% of disrupting data. + +Lenght of all tweets: +- before preprocessing: 52686072 +- after preprocessing (just englisch + punctation + stopwords): 39666607 +39666607/52686072 = 0.75 ### Interpretation -Is there anything we can learn from these results? - -## Preprocessing +Probably, no real interpretation possible, so feel free to leave this section out. -I'm following the "Design Decisions - Results - Interpretation" structure here, -but you can also just use one subheading per preprocessing step to organize -things (depending on what you do, that may be better structured). +## Evaluation ### Design Decisions -Which kind of preprocessing steps did you implement? Why are they necessary -and/or useful down the road? +Which evaluation metrics did you use and why? +Which baselines did you use and why? ### Results -Maybe show a short example what your preprocessing does. +How do the baselines perform with respect to the evaluation metrics? ### Interpretation -Probably, no real interpretation possible, so feel free to leave this section out. +Is there anything we can learn from these results? ## Feature Extraction -Again, either structure among decision-result-interpretation or based on feature, -up to you. +Again, either structure under decision-result interpretation or based on features, +is up to you. + + ### Design Decisions Which features did you implement? What's their motivation and how are they computed? +We want to try something we didn't hear in the lecture. Therefore, we used the HashingVectorizer from sklearn to create an individual hash for each tweet. For a sentence like 'I love Machine Learning', the output can look like [0.4, 0.3, 0.9, 0, 0.21], with length n representing the number of features. It's not very intuitive to humans why this works, but after a long time of version conflicts and other problems, we enjoyed the simplicity of using sklearn. + +Usage: `--hash_vec` +and for number of features for hash vector edit HASH_VECTOR_N_FEATURES in util.py ### Results Can you say something about how the feature values are distributed? Maybe show some plots? +When we finally ran it successfully with 25 features, we tried it with the SVM classifier, but that took too much time (nearly endless), so we used KNN with 4 NN on a 20000 sample subset and for the first time our Cohen kappa went from 0.0 to 0.1 and after some tuning (using more data) to 0.3. + + ### Interpretation Can we already guess which features may be more useful than others? @@ -78,12 +108,13 @@ Can we somehow make sense of the dimensionality reduction results? Which features are the most important ones and why may that be the case? ## Classification - +First of all we add a new argument: --small 1000 which would just use 1000s tweets. ### Design Decisions Which classifier(s) did you use? Which hyperparameter(s) (with their respective candidate values) did you look at? What were your reasons for this? +- SVM ### Results The big finale begins: What are the evaluation results you obtained with your @@ -94,4 +125,4 @@ selected setup: How well does it generalize to the test set? Which hyperparameter settings are how important for the results? How good are we? Can this be used in practice or are we still too bad? -Anything else we may have learned? \ No newline at end of file +Anything else we may have learned? diff --git a/README.md b/README.md index f1c12d81..24964e9a 100644 --- a/README.md +++ b/README.md @@ -19,6 +19,8 @@ conda install -y -q -c conda-forge gensim=4.1.2 conda install -y -q -c conda-forge spyder=5.1.5 conda install -y -q -c conda-forge pandas=1.1.5 conda install -y -q -c conda-forge mlflow=1.20.2 +conda install -y -q -c conda-forge spacy +conda install -c conda-forge langdetect ``` You can double-check that all of these packages have been installed by running `conda list` inside of your virtual environment. The Spyder IDE can be started by typing `~/miniconda/envs/MLinPractice/bin/spyder` in your terminal window (assuming you use miniconda, which is installed right in your home directory). @@ -91,6 +93,8 @@ The features to be extracted can be configured with the following optional param Moreover, the script support importing and exporting fitted feature extractors with the following optional arguments: - `-i` or `--import_file`: Load a configured and fitted feature extraction from the given pickle file. Ignore all parameters that configure the features to extract. - `-e` or `--export_file`: Export the configured and fitted feature extraction into the given pickle file. +- `--hash_vec`: use HashingVectorizer from sklearn. +and for number of features for hash vector edit HASH_VECTOR_N_FEATURES in util.py ## Dimensionality Reduction @@ -128,7 +132,7 @@ By default, this data is used to train a classifier, which is specified by one o The classifier is then evaluated, using the evaluation metrics as specified through the following optional arguments: - `-a`or `--accuracy`: Classification accurracy (i.e., percentage of correctly classified examples). - `-k`or `--kappa`: Cohen's kappa (i.e., adjusting accuracy for probability of random agreement). - +- `--small 1000`: use just 1000 tweets. Moreover, the script support importing and exporting trained classifiers with the following optional arguments: - `-i` or `--import_file`: Load a trained classifier from the given pickle file. Ignore all parameters that configure the classifier to use and don't retrain the classifier. diff --git a/code/classification.sh b/code/classification.sh index ceb7ac18..b2880091 100755 --- a/code/classification.sh +++ b/code/classification.sh @@ -5,10 +5,9 @@ mkdir -p data/classification/ # run feature extraction on training set (may need to fit extractors) echo " training set" -python -m code.classification.run_classifier data/dimensionality_reduction/training.pickle -e data/classification/classifier.pickle --knn 5 -s 42 --accuracy --kappa - +python -m code.classification.run_classifier data/dimensionality_reduction/training.pickle -e data/classification/classifier.pickle --svm --knn 4 --accuracy --kappa # run feature extraction on validation set (with pre-fit extractors) echo " validation set" python -m code.classification.run_classifier data/dimensionality_reduction/validation.pickle -i data/classification/classifier.pickle --accuracy --kappa -# don't touch the test set, yet, because that would ruin the final generalization experiment! \ No newline at end of file +# don't touch the test set, yet, because that would ruin the final generalization experiment! diff --git a/code/classification/run_classifier.py b/code/classification/run_classifier.py index b9d55245..823fd37c 100644 --- a/code/classification/run_classifier.py +++ b/code/classification/run_classifier.py @@ -11,9 +11,11 @@ import argparse, pickle from sklearn.dummy import DummyClassifier from sklearn.metrics import accuracy_score, cohen_kappa_score -from sklearn.preprocessing import StandardScaler +from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.pipeline import make_pipeline +from sklearn.preprocessing import StandardScaler + # setting up CLI parser = argparse.ArgumentParser(description = "Classifier") @@ -23,11 +25,14 @@ parser.add_argument("-i", "--import_file", help = "import a trained classifier from the given location", default = None) parser.add_argument("-m", "--majority", action = "store_true", help = "majority class classifier") parser.add_argument("-f", "--frequency", action = "store_true", help = "label frequency classifier") +parser.add_argument("-v", "--svm", action = "store_true", help = "SVM classifier") parser.add_argument("--knn", type = int, help = "k nearest neighbor classifier with the specified value of k", default = None) parser.add_argument("-a", "--accuracy", action = "store_true", help = "evaluate using accuracy") parser.add_argument("-k", "--kappa", action = "store_true", help = "evaluate using Cohen's kappa") -args = parser.parse_args() +parser.add_argument("--small", type = int, help = "not use all data but just subset", default = None) +args = parser.parse_args() +#args, unk = parser.parse_known_args() # load data with open(args.input_file, 'rb') as f_in: data = pickle.load(f_in) @@ -43,24 +48,37 @@ # majority vote classifier print(" majority vote classifier") classifier = DummyClassifier(strategy = "most_frequent", random_state = args.seed) - elif args.frequency: # label frequency classifier print(" label frequency classifier") classifier = DummyClassifier(strategy = "stratified", random_state = args.seed) - - + elif args.svm: + print(" SVM classifier") + classifier = make_pipeline(StandardScaler(), SVC(probability=True)) elif args.knn is not None: print(" {0} nearest neighbor classifier".format(args.knn)) standardizer = StandardScaler() knn_classifier = KNeighborsClassifier(args.knn) classifier = make_pipeline(standardizer, knn_classifier) - - classifier.fit(data["features"], data["labels"].ravel()) + + + +if args.small is not None: + # if limit is given + max_length = len(data['features']) + limit = min(args.small, max_length) + # go through data and limit it + for key, value in data.items(): + data[key] = value[:limit] + + +classifier.fit(data["features"], data["labels"].ravel()) # now classify the given data prediction = classifier.predict(data["features"]) + + # collect all evaluation metrics evaluation_metrics = [] if args.accuracy: @@ -75,4 +93,4 @@ # export the trained classifier if the user wants us to do so if args.export_file is not None: with open(args.export_file, 'wb') as f_out: - pickle.dump(classifier, f_out) \ No newline at end of file + pickle.dump(classifier, f_out) diff --git a/code/dimensionality_reduction/reduce_dimensionality.py b/code/dimensionality_reduction/reduce_dimensionality.py index d2b27419..7d4da260 100644 --- a/code/dimensionality_reduction/reduce_dimensionality.py +++ b/code/dimensionality_reduction/reduce_dimensionality.py @@ -40,6 +40,7 @@ if args.mutual_information is not None: # select K best based on Mutual Information dim_red = SelectKBest(mutual_info_classif, k = args.mutual_information) + dim_red.fit(features, labels.ravel()) # resulting feature names based on support given by SelectKBest @@ -64,6 +65,7 @@ def get_feature_names(kbest, names): # store the results output_data = {"features": reduced_features, "labels": labels} + with open(args.output_file, 'wb') as f_out: pickle.dump(output_data, f_out) diff --git a/code/feature_extraction/extract_features.py b/code/feature_extraction/extract_features.py index a3527acf..fae6d04d 100644 --- a/code/feature_extraction/extract_features.py +++ b/code/feature_extraction/extract_features.py @@ -12,8 +12,9 @@ import pandas as pd import numpy as np from code.feature_extraction.character_length import CharacterLength +from code.feature_extraction.hash_vector import HashVector from code.feature_extraction.feature_collector import FeatureCollector -from code.util import COLUMN_TWEET, COLUMN_LABEL +from code.util import COLUMN_TWEET, COLUMN_LABEL, COLUMN_PREPROCESS # setting up CLI @@ -23,6 +24,7 @@ parser.add_argument("-e", "--export_file", help = "create a pipeline and export to the given location", default = None) parser.add_argument("-i", "--import_file", help = "import an existing pipeline from the given location", default = None) parser.add_argument("-c", "--char_length", action = "store_true", help = "compute the number of characters in the tweet") +parser.add_argument("--hash_vec", action = "store_true", help = "compute the hash vector of the tweet") args = parser.parse_args() # load data @@ -40,13 +42,18 @@ if args.char_length: # character length of original tweet (without any changes) features.append(CharacterLength(COLUMN_TWEET)) - + if args.hash_vec: + # hash of original tweet (without any changes) + features.append(HashVector(COLUMN_TWEET)) + + # create overall FeatureCollector feature_collector = FeatureCollector(features) # fit it on the given data set (assumed to be training data) feature_collector.fit(df) + # apply the given FeatureCollector on the current data set # maps the pandas DataFrame to an numpy array @@ -59,6 +66,7 @@ # store the results results = {"features": feature_array, "labels": label_array, "feature_names": feature_collector.get_feature_names()} + with open(args.output_file, 'wb') as f_out: pickle.dump(results, f_out) diff --git a/code/feature_extraction/hash_vector.py b/code/feature_extraction/hash_vector.py new file mode 100644 index 00000000..852140be --- /dev/null +++ b/code/feature_extraction/hash_vector.py @@ -0,0 +1,37 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Simple feature that counts the number of characters in the given column. + +Created on Wed Sep 29 12:29:25 2021 + +@author: lbechberger +""" + +import numpy as np +from code.feature_extraction.feature_extractor import FeatureExtractor +from sklearn.feature_extraction.text import HashingVectorizer + +from code.util import HASH_VECTOR_N_FEATURES + +# class for extracting the character-based length as a feature + + +class HashVector(FeatureExtractor): + + # constructor + def __init__(self, input_column): + super().__init__([input_column], "{0}_hashvector".format(input_column)) + + # don't need to fit, so don't overwrite _set_variables() + + # compute the word length based on the inputs + def _get_values(self, inputs): + # inputs is list of text documents + # create the transform + # pdb.set_trace() + vectorizer = HashingVectorizer(n_features=HASH_VECTOR_N_FEATURES, + strip_accents='ascii', stop_words='english', ngram_range=(2, 2)) + # encode document + vector = vectorizer.fit_transform(inputs[0]) + return vector.toarray() diff --git a/code/preprocessing.sh b/code/preprocessing.sh index 61f83ea6..b381f36e 100755 --- a/code/preprocessing.sh +++ b/code/preprocessing.sh @@ -1,19 +1,19 @@ #!/bin/bash # create directory if not yet existing -mkdir -p data/preprocessing/split/ +#mkdir -p data/preprocessing/split/ # install all NLTK models -python -m nltk.downloader all +#python -m nltk.downloader all # add labels -echo " creating labels" +echo -e "\n -> creating labels\n" python -m code.preprocessing.create_labels data/raw/ data/preprocessing/labeled.csv # other preprocessing (removing punctuation etc.) -echo " general preprocessing" -python -m code.preprocessing.run_preprocessing data/preprocessing/labeled.csv data/preprocessing/preprocessed.csv --punctuation --tokenize -e data/preprocessing/pipeline.pickle +echo -e "\n -> general preprocessing\n" +python -m code.preprocessing.run_preprocessing data/preprocessing/labeled.csv data/preprocessing/preprocessed.csv --punctuation --strings --tokenize --language en -e data/preprocessing/pipeline.pickle # split the data set -echo " splitting the data set" +echo -e "\n -> splitting the data set\n" python -m code.preprocessing.split_data data/preprocessing/preprocessed.csv data/preprocessing/split/ -s 42 \ No newline at end of file diff --git a/code/preprocessing/create_labels.py b/code/preprocessing/create_labels.py index 21b1748d..860a5fe0 100644 --- a/code/preprocessing/create_labels.py +++ b/code/preprocessing/create_labels.py @@ -28,7 +28,7 @@ # load all csv files dfs = [] for file_path in file_paths: - dfs.append(pd.read_csv(file_path, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n")) + dfs.append(pd.read_csv(file_path, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n", low_memory=False)) # join all data into a single DataFrame df = pd.concat(dfs) diff --git a/code/preprocessing/language_remover.py b/code/preprocessing/language_remover.py new file mode 100644 index 00000000..5466c1c1 --- /dev/null +++ b/code/preprocessing/language_remover.py @@ -0,0 +1,29 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- + + +import string +from code.preprocessing.preprocessor import Preprocessor +from langdetect import detect +from code.util import COLUMN_TWEET, COLUMN_LANGUAGE + +class LanguageRemover(Preprocessor): + + # constructor + def __init__(self, input_column = COLUMN_TWEET, output_column = COLUMN_LANGUAGE): #, language_to_keep = 'en' + # input column "tweet", new output column + super().__init__([input_column], output_column) + #self.language_to_keep = language_to_keep + + # set internal variables based on input columns + #def _set_variables(self, inputs): + # store punctuation for later reference + #self._punctuation = "[{}]".format(string.punctuation) + #self.nlp = spacy.load('en') # 1 + #self.nlp.add_pipe(LanguageDetector(), name='language_detector', last=True) #2 + + + # get preprocessed column based on data frame and internal variables + def _get_values(self, inputs): + column = [detect(tweet) for tweet in inputs[0]] + return column \ No newline at end of file diff --git a/code/preprocessing/punctuation_remover.py b/code/preprocessing/punctuation_remover.py index 0f026b0e..b8e0258c 100644 --- a/code/preprocessing/punctuation_remover.py +++ b/code/preprocessing/punctuation_remover.py @@ -1,7 +1,7 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ -Preprocessor that removes punctuation from the original tweet text. +Preprocessor that removes punctuation & digits from the original tweet text. Created on Wed Sep 29 09:45:56 2021 @@ -11,23 +11,29 @@ import string from code.preprocessing.preprocessor import Preprocessor from code.util import COLUMN_TWEET, COLUMN_PUNCTUATION +import pdb + +punct = set(string.punctuation).union(string.digits).union('—') +#print(str(''.join(punct))) # removes punctuation from the original tweet # inspired by https://stackoverflow.com/a/45600350 class PunctuationRemover(Preprocessor): # constructor - def __init__(self): - # input column "tweet", new output column - super().__init__([COLUMN_TWEET], COLUMN_PUNCTUATION) + def __init__(self, inputcol, outputcol): + # input column, new output column + super().__init__([inputcol], outputcol) # set internal variables based on input columns def _set_variables(self, inputs): # store punctuation for later reference - self._punctuation = "[{}]".format(string.punctuation) + self._punctuation = "[{}]".format(string.punctuation+string.digits+'’'+'—'+'”'+'➡️') # get preprocessed column based on data frame and internal variables def _get_values(self, inputs): # replace punctuation with empty string column = inputs[0].str.replace(self._punctuation, "") + #import pdb + #pdb.set_trace() return column \ No newline at end of file diff --git a/code/preprocessing/run_preprocessing.py b/code/preprocessing/run_preprocessing.py index 72130a30..78181775 100644 --- a/code/preprocessing/run_preprocessing.py +++ b/code/preprocessing/run_preprocessing.py @@ -10,35 +10,58 @@ import argparse, csv, pickle import pandas as pd +from tqdm import tqdm from sklearn.pipeline import make_pipeline from code.preprocessing.punctuation_remover import PunctuationRemover +from code.preprocessing.string_remover import StringRemover +from code.preprocessing.language_remover import LanguageRemover from code.preprocessing.tokenizer import Tokenizer -from code.util import COLUMN_TWEET, SUFFIX_TOKENIZED +from code.util import COLUMN_TWEET, SUFFIX_TOKENIZED, COLUMN_LANGUAGE # setting up CLI parser = argparse.ArgumentParser(description = "Various preprocessing steps") parser.add_argument("input_file", help = "path to the input csv file") parser.add_argument("output_file", help = "path to the output csv file") -parser.add_argument("-p", "--punctuation", action = "store_true", help = "remove punctuation") +parser.add_argument("-p", "--punctuation", action = "store_true", help = "remove punctuation and special characters") +parser.add_argument("-s", "--strings", action = "store_true", help = "remove stopwords, links and emojis") parser.add_argument("-t", "--tokenize", action = "store_true", help = "tokenize given column into individual words") -parser.add_argument("--tokenize_input", help = "input column to tokenize", default = COLUMN_TWEET) +#parser.add_argument("--tokenize_input", help = "input column to tokenize", default = 'output') parser.add_argument("-e", "--export_file", help = "create a pipeline and export to the given location", default = None) +parser.add_argument("--language", help = "just use tweets with this language ", default = None) args = parser.parse_args() # load data -df = pd.read_csv(args.input_file, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n") +df = pd.read_csv(args.input_file, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n",low_memory=False) + +preprocess_col = 'preprocess_col' # collect all preprocessors preprocessors = [] if args.punctuation: - preprocessors.append(PunctuationRemover()) + preprocessors.append(PunctuationRemover("tweet", preprocess_col)) +if args.strings: + preprocessors.append(StringRemover(preprocess_col, preprocess_col)) if args.tokenize: - preprocessors.append(Tokenizer(args.tokenize_input, args.tokenize_input + SUFFIX_TOKENIZED)) + preprocessors.append(Tokenizer(preprocess_col, preprocess_col + SUFFIX_TOKENIZED)) + +# no need to detect languages, because it is already given +# if args.language is not None: +# preprocessors.append(LanguageRemover()) + +if args.language is not None: + # filter out one language + before = len(df) + df = df[df['language']==args.language] + after = len(df) + print("Filtered out: {0} (not 'en')".format(before-after)) + df.reset_index(drop=True, inplace=True) # call all preprocessing steps -for preprocessor in preprocessors: +for preprocessor in tqdm(preprocessors): df = preprocessor.fit_transform(df) +# drop useless line which makes problems with csv +del df['trans_dest\r'] # store the results df.to_csv(args.output_file, index = False, quoting = csv.QUOTE_NONNUMERIC, line_terminator = "\n") @@ -46,4 +69,7 @@ if args.export_file is not None: pipeline = make_pipeline(*preprocessors) with open(args.export_file, 'wb') as f_out: - pickle.dump(pipeline, f_out) \ No newline at end of file + pickle.dump(pipeline, f_out) + + + diff --git a/code/preprocessing/split_data.py b/code/preprocessing/split_data.py index 57bad668..88f0ff63 100644 --- a/code/preprocessing/split_data.py +++ b/code/preprocessing/split_data.py @@ -23,7 +23,7 @@ args = parser.parse_args() # load the data -df = pd.read_csv(args.input_file, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n") +df = pd.read_csv(args.input_file, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n", low_memory=False) # split into (training & validation) and test set X, X_test = train_test_split(df, test_size = args.test_size, random_state = args.seed, shuffle = True, stratify = df[COLUMN_LABEL]) diff --git a/code/preprocessing/string_remover.py b/code/preprocessing/string_remover.py new file mode 100644 index 00000000..7d11d045 --- /dev/null +++ b/code/preprocessing/string_remover.py @@ -0,0 +1,45 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Preprocessor that removes punctuation from the original tweet text. +Created on Wed Sep 29 09:45:56 2021 +@author: lbechberger +""" +import string +from code.preprocessing.preprocessor import Preprocessor +from code.util import COLUMN_TWEET, COLUMN_PUNCTUATION +from nltk.corpus import stopwords +import pandas as pd + +STOPWORDS = set(stopwords.words('english')) + +# removes punctuation from the original tweet +# inspired by https://stackoverflow.com/a/45600350 +class StringRemover(Preprocessor): + + # constructor + def __init__(self, inputcol, outputcol): + # input column "tweet", new output column + super().__init__([inputcol], outputcol) + + # set internal variables based on input columns + #def _set_variables(self, inputs): + # store punctuation for later reference + # self._punctuation = "[{}]".format(string.punctuation) + + + # get preprocessed column based on data frame and internal variables + def _get_values(self, inputs): + column = inputs[0].str #.replace(self._punctuation, "") + + # replace stopwords with empty string + column = [' '.join([word for word in tweet if word.lower() not in STOPWORDS]) for tweet in column.split()] + column = pd.Series(column) + # replace links with empty string + column = [' '.join([word for word in tweet if word.startswith('https') is False]) for tweet in column.str.split()] + column = pd.Series(column) + # replace emojis with empty string + column = [' '.join([word for word in tweet if str(word.encode('unicode-escape').decode('ASCII')).__contains__('\\') is False]) for tweet in column.str.split()] + + column = pd.Series(column) + return column \ No newline at end of file diff --git a/code/preprocessing/tokenizer.py b/code/preprocessing/tokenizer.py index 94191502..85420b2d 100644 --- a/code/preprocessing/tokenizer.py +++ b/code/preprocessing/tokenizer.py @@ -24,14 +24,20 @@ def _get_values(self, inputs): """Tokenize the tweet.""" tokenized = [] - + import pdb for tweet in inputs[0]: - sentences = nltk.sent_tokenize(tweet) + #pdb.set_trace() + if type(tweet) is float: + # if tweet is nan, maybe because of stopword remove + sentences = nltk.sent_tokenize('') + else: + sentences = nltk.sent_tokenize(tweet) tokenized_tweet = [] for sentence in sentences: words = nltk.word_tokenize(sentence) tokenized_tweet += words tokenized.append(str(tokenized_tweet)) - + + #pdb.set_trace() return tokenized \ No newline at end of file diff --git a/code/util.py b/code/util.py index 7d8794c7..37fe5bd7 100644 --- a/code/util.py +++ b/code/util.py @@ -16,5 +16,9 @@ # column names of novel columns for preprocessing COLUMN_LABEL = "label" COLUMN_PUNCTUATION = "tweet_no_punctuation" +COLUMN_LANGUAGE = "language" +COLUMN_PREPROCESS = 'preprocess_col' +SUFFIX_TOKENIZED = "_tokenized" -SUFFIX_TOKENIZED = "_tokenized" \ No newline at end of file +# number of features for hash vector +HASH_VECTOR_N_FEATURES = 2**10 \ No newline at end of file diff --git a/codes/feature_extraction.sh b/codes/feature_extraction.sh new file mode 100755 index 00000000..e6b7ea3c --- /dev/null +++ b/codes/feature_extraction.sh @@ -0,0 +1,14 @@ +#!/bin/bash + +# create directory if not yet existing +mkdir -p data/feature_extraction/ + +# run feature extraction on training set (may need to fit extractors) +echo " training set" +python -m codes.feature_extraction.extract_features data/preprocessing/split/training.csv data/feature_extraction/training.pickle -e data/feature_extraction/pipeline.pickle --char_length --photo_bool --replies_count + +# run feature extraction on validation set and test set (with pre-fit extractors) +echo " validation set" +python -m codes.feature_extraction.extract_features data/preprocessing/split/validation.csv data/feature_extraction/validation.pickle -i data/feature_extraction/pipeline.pickle +echo " test set" +python -m codes.feature_extraction.extract_features data/preprocessing/split/test.csv data/feature_extraction/test.pickle -i data/feature_extraction/pipeline.pickle diff --git a/codes/feature_extraction/extract_features.py b/codes/feature_extraction/extract_features.py new file mode 100644 index 00000000..f3526564 --- /dev/null +++ b/codes/feature_extraction/extract_features.py @@ -0,0 +1,85 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Runs the specified collection of feature extractors. + +Created on Wed Sep 29 11:00:24 2021 + +@author: lbechberger +""" + +import argparse, csv, pickle +import pandas as pd +import numpy as np +from codes.feature_extraction.character_length import CharacterLength +from codes.feature_extraction.hash_vector import HashVector +from codes.feature_extraction.feature_collector import FeatureCollector +from codes.feature_extraction.photo_bool import PhotoBool +from codes.feature_extraction.replies_count import RepliesCount +from codes.util import COLUMN_TWEET, COLUMN_LABEL, COLUMN_PREPROCESS, COLUMN_PHOTOS, COLUMN_REPLIES + + +# setting up CLI +parser = argparse.ArgumentParser(description = "Feature Extraction") +parser.add_argument("input_file", help = "path to the input csv file") +parser.add_argument("output_file", help = "path to the output pickle file") +parser.add_argument("-e", "--export_file", help = "create a pipeline and export to the given location", default = None) +parser.add_argument("-i", "--import_file", help = "import an existing pipeline from the given location", default = None) +parser.add_argument("-c", "--char_length", action = "store_true", help = "compute the number of characters in the tweet") +parser.add_argument("--hash_vec", action = "store_true", help = "compute the hash vector of the tweet") +parser.add_argument("--photo_bool", action= "store_true", help= "tells whether the tweet contains photos or not") +parser.add_argument("--replies_count", action="store_true", help="compute the amount of replies of the tweet") +args = parser.parse_args() + +# load data +df = pd.read_csv(args.input_file, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n") + +if args.import_file is not None: + # simply import an exisiting FeatureCollector + with open(args.import_file, "rb") as f_in: + feature_collector = pickle.load(f_in) + +else: # need to create FeatureCollector manually + + # collect all feature extractors + features = [] + if args.char_length: + # character length of original tweet (without any changes) + features.append(CharacterLength(COLUMN_TWEET)) + if args.hash_vec: + # hash of original tweet (without any changes) + features.append(HashVector(COLUMN_TWEET)) + if args.photo_bool: + # do photos exist or not + features.append(PhotoBool(COLUMN_PHOTOS)) + if args.replies_count: + # how many replies does the tweet have + features.append(RepliesCount(COLUMN_REPLIES)) + + # create overall FeatureCollector + feature_collector = FeatureCollector(features) + + # fit it on the given data set (assumed to be training data) + feature_collector.fit(df) + + + +# apply the given FeatureCollector on the current data set +# maps the pandas DataFrame to an numpy array +feature_array = feature_collector.transform(df) + +# get label array +label_array = np.array(df[COLUMN_LABEL]) +label_array = label_array.reshape(-1, 1) + +# store the results +results = {"features": feature_array, "labels": label_array, + "feature_names": feature_collector.get_feature_names()} + +with open(args.output_file, 'wb') as f_out: + pickle.dump(results, f_out) + +# export the FeatureCollector as pickle file if desired by user +if args.export_file is not None: + with open(args.export_file, 'wb') as f_out: + pickle.dump(feature_collector, f_out) diff --git a/codes/feature_extraction/photo_bool.py b/codes/feature_extraction/photo_bool.py new file mode 100644 index 00000000..93942b13 --- /dev/null +++ b/codes/feature_extraction/photo_bool.py @@ -0,0 +1,33 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Simple feature that tells whether photos are present or not. + +Created on Wed Sep 29 12:29:25 2021 + +@author: shagemann +""" + +import numpy as np +from codes.feature_extraction.feature_extractor import FeatureExtractor + +# class for extracting the photo-bool as a feature +class PhotoBool(FeatureExtractor): + + # constructor + def __init__(self, input_column): + super().__init__([input_column], "{0}_photo_bool".format(input_column)) + + # don't need to fit, so don't overwrite _set_variables() + + # 0 if no photos, return 1 else + def _get_values(self, inputs): + values = [] + for index, row in inputs[0].iteritems(): + if len(row) > 2: + values.append(1) + else: + values.append(0) + result = np.array(values) + result = result.reshape(-1,1) + return result diff --git a/codes/feature_extraction/replies_count.py b/codes/feature_extraction/replies_count.py new file mode 100644 index 00000000..ada34fd4 --- /dev/null +++ b/codes/feature_extraction/replies_count.py @@ -0,0 +1,30 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Simple feature that tells how many replies a tweet has. + +Created on Wed Sep 29 12:29:25 2021 + +@author: shagemann +""" + +import numpy as np +from codes.feature_extraction.feature_extractor import FeatureExtractor + +# class for extracting the photo-bool as a feature +class RepliesCount(FeatureExtractor): + + # constructor + def __init__(self, input_column): + super().__init__([input_column], "{0}_replies_count".format(input_column)) + + # don't need to fit, so don't overwrite _set_variables() + + # use the replies count column as a feature + def _get_values(self, inputs): + values = [] + for index, row in inputs[0].iteritems(): + values.append(int(row)) + result = np.array(values) + result = result.reshape(-1,1) + return result diff --git a/codes/util.py b/codes/util.py new file mode 100644 index 00000000..d845bded --- /dev/null +++ b/codes/util.py @@ -0,0 +1,26 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Utility file for collecting frequently used constants and helper functions. + +Created on Wed Sep 29 10:50:36 2021 + +@author: lbechberger +""" + +# column names for the original data frame +COLUMN_TWEET = "tweet" +COLUMN_LIKES = "likes_count" +COLUMN_RETWEETS = "retweets_count" +COLUMN_PHOTOS = "photos" +COLUMN_REPLIES = "replies_count" + +# column names of novel columns for preprocessing +COLUMN_LABEL = "label" +COLUMN_PUNCTUATION = "tweet_no_punctuation" +COLUMN_LANGUAGE = "language" +COLUMN_PREPROCESS = 'preprocess_col' +SUFFIX_TOKENIZED = "_tokenized" + +# number of features for hash vector +HASH_VECTOR_N_FEATURES = 2**3 diff --git a/test/feature_extraction/hash_vector_test.py b/test/feature_extraction/hash_vector_test.py new file mode 100644 index 00000000..1adb0d69 --- /dev/null +++ b/test/feature_extraction/hash_vector_test.py @@ -0,0 +1,33 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Created on Thu Oct 7 14:51:00 2021 + +@author: ml +""" + +import unittest +import pandas as pd +import nltk +from code.feature_extraction.hash_vector import HashVector + +class HashVectorTest(unittest.TestCase): + + def setUp(self): + self.INPUT_COLUMN = "input" + self.hash_vector_feature = HashVector(self.INPUT_COLUMN) + self.df = pd.DataFrame() + self.df[self.INPUT_COLUMN] = ['["This", "is", "a", "tweet", "This", "is", "also", "a", "test"]', '["This", "is", "a", "tweet", "This", "is", "also", "a", "test"]'] + + def test_input_columns(self): + self.assertEqual(self.hash_vector_feature._input_columns, [self.INPUT_COLUMN]) + + def test_feature_name(self): + self.assertEqual(self.hash_vector_feature.get_feature_name(), self.INPUT_COLUMN + "_hashvector") + + + + + +if __name__ == '__main__': + unittest.main() \ No newline at end of file