diff --git a/Documentation.md.orig b/Documentation.md.orig deleted file mode 100644 index 18c0d3b5..00000000 --- a/Documentation.md.orig +++ /dev/null @@ -1,58 +0,0 @@ -# Documentation - -## Preprocessing -**Lower case** -- turns every word in the tweet into lower case. Otherwise our classifier would think of e.g. "Dog" and "dog" as two completly different words. - -## Feature Extraction -<<<<<<< HEAD -**PhotoAdded** -- check whether a tweet has at least one photo added -- we assume that a tweet with photos added is more likely to go viral - -======= -**Sentiment Analysis** -- added compound sentiment score as a feature. We argue that the sentiment of a tweet has an influence on its virality. -- we used nltk.VADER which is specifically attuned to sentiments expressed in social media -- our function returns a score between -1 very negative and 1 very positive. -- our hypothesis is that tweets that are in the outer range either very negative or very positive will be more popular than neutral tweets with scores around 0. -**HashtagCounter** --count number of hashtags -- we assume that tweets with many hashtags attract more attention than tweets without hashtags. ->>>>>>> main -## Dimensionality Reduction - -## Classification - -### Evaluation metrics -**Accuracy** -- total number of correct predictions divided by total number of predictions in the data set -- default implementation but inappropriate due to imbalanced data set - -**Balanced Accuracy** -- arithmetic mean of precision (number of positive class predictions that actually belong to the positive class) and recall (positive class predictions made out of all positive examples) -- variant of accuracy metric that is more appropriate for imbalanced data sets - -**F1 Score** -- balances precision and recall in a single number -- f1 considers false positives and false negatives as equally important -- appropriate for imbalanced data sets -- generally better scoring metric than *balanced accuracy* for imbalanced data when more attention on the positives is needed (in our case: tweets predicted as viral) - -**Cohen's Kappa** -- measures interrater reliability (reliability for two raters that are rating the same thing, corrected for how often that the raters may agree by chance) -- appropriate for imbalanced data sets - -We chose these evaluation metrics due to their respective properties, because they are good to use in our highly imbalanced data set (5% viral 95% not) - -### Evaluation baseline -**Majority vote classifier** -- always predicts the majority class (in our case "not viral") - -**Uniform distribution classifier** -- generates predictions uniformly at random - -**Frequency classifier** -- generates random predictions respecting the training set class distribution (i.e. the label frequency) - -## Application