Machine Learning in Practice block seminar, winter term 2021/22 @ UOS.
Held by Lucas Bechberger, M.Sc.
Group members: Dennis Hesenkamp, Yannik Ullrich
- Introduction
- Preprocessing
- Feature Extraction
- Dimensionality Reduction
- Classification
- Evaluation Metrics
- Hyperparameter Optimization
- Results
- Conclusion
- Next Steps
- Resources
This file contains the documentation for our project, which aims to classify tweets as viral/non-viral based on multiple features derived from
- the metadata of the tweet and
- the natural language features of the tweet.
The data set used is Ruchi Bhatia's Data Science Tweets 2010-2021 from Kaggle. The code base on which we built our machine learning pipeline was provided by Lucas Bechberger (lecturer) and can be found here.
We can see that, over the years, the interest in data science and related topics as grown very fast:
Fig 1: Tweets per year.Also, most tweets in our data set are written in English, as can be seen here:
Fig 2: Language distribution of the tweets.The data set provides the raw tweet as it has been posted as well as multiple features related to the tweet, for instance the person who published it, the time it has been published at, whether it contained any media (be it photo, video, url, etc.), and many more. We employed multiple preprocessing steps to transform the input data into a more usable format for feature extraction steps later on.
As a very first step, we split the data set into a training (60%), a validation (20%), and a test set (20%). We will work on the training and validation set to implement everything to then, eventually, test our classifiers on the test set.
In the lecture, Lucas implemented a tokenizer to disassemble tweets into individual words using the nltk
library1. This is done to split up the raw tweet into its constituents, i.e. the single words and punctuation signs it contains. By doing so, further processing and feature extraction can be performed by looking at the single components of a sentence/tweet as opposed to working with one long string.
Example:
import nltk
sent = 'There is great genius behind all this.'
nltk.word_tokenize(sent)
# ['There', 'is', 'great', 'genius', 'behind', 'all', 'this', '.']
To extract meaningful natural language features from a string, it makes sense to first remove any stopwords occuring in that string. Say, for example, one would like to look at the most frequently occuring words in a large corpus. Usually, that means looking at words which actually carry meaning in the given context. According to the OEC2, the largest 21st-century English text corpus, the commonest word in English is the - from which we cannot derive any meaning. Hence, it would make sense to remove words such as the and other, non-meaning carrying words (= stopwords) from a corpus (the set of tweets in our case) before doing anything like keyword of occurence frequency analysis.
There is not one universal stopword list nor are there universal rules on how stopwords should be defined. For the sake of convenience, we decided to use gensim
's gensim.parsing.preprocessing.remove_stopwords
function3, which uses gensim
's built-in stopword list containing high-frequency words with little lexical content.
Example:
import gensim
sent = 'There is great genius behind all this.'
gensim.parsing.preprocessing.remove_stopwords(sent)
# 'There great genius this.'
Other options would have been nltk
's stopword corpus4, an annotated corpus with 2.400 stopwords from 11 languages or spaCy
's stopword list5, but we faced problems implementing the former one while gensim
's corpus apparently contains more words and leads to better results compared to the latter.
Punctuation removal follows the same rationale as stopword removal: A dot, hyphen, or exclamation mark will probably occur often in the corpus, but without carrying much meaning at first sight (we can actually also infer features from punctuation, more about that in Sentiment Analysis). A feature for removing punctuation from the raw tweet has already been implemented by Lucas during the lecture using the string
package. Again, alternatives can be used - for example with gensim
, which offers a function for punctuation removal6. We had to rebuild this class as it was initially meant to work as first step in the preprocessing pipeline, but we now have it in second place. Hence, it was necessary to change how the class handles input and output and we needed an additional command line argument. We did not change the method of removing punctuation in general, as there is not much benefit in looking at different ways of punctuation removal anyways, as opposed to stopword removal, where lists can vary a lot based on the corpus.
Example:
import string
sent = "O, that my tongue were in the thunder's mouth!"
punctuation = '[{}]'.format(string.punctuation)
sent.replace(punctuation, '')
# "O that my tongue were in the thunders mouth"
Caveat: the above code will actually not produce the desired output, but works in our implementation due to the different format of the input (we pass a dtype object
as input). This is just to illustrate how our code and punctuation removal in general work.
Lemmatization modifies an inflected or variant form of a word into its lemma or dictionary form. Through lemmatization, we can make sure that words - on a semantical level - get interpreted in the same way, even when inflected: walk and walking, for example, stem from the same word and ultimately carry the same meaning. We decided to use lemmatization as opposed to stemming, although it is computationally more expensive. This is due to lemmatization taking context into account, as it depends on part-of-speech (PoS) tagging.
To implement this, we used nltk
's pos_tag
to assign PoS tags and WordNet's WordNetLemmatizer()
class, as well as a manually defined PoS dictionary to reduce the (rather detailed) tags from pos_tag
to only four different ones, namely noun, verb, adjective, and adverb:
from nltk.corpus import wordnet
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
This simplified PoS assignment is important because pos_tag
returns a tuple, which has to be converted to a format the WordNet lemmatizer can work with, further WordNet lemmatizes differently for different PoS classes and only distinguishes between the above mentioned classes. Courtesy to this blog entry by Selva Prabhakaran for the idea and the code.
Example:
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
sent = ['These', 'newer', 'data', 'help', 'scientists', 'accurately', 'project', 'how', 'quickly', 'glaciers', 'are', 'retreating', '.']
lem = WordNetLemmatizer()
lemmatized = []
for word in sent:
# get the part-of-speech tag
tag = pos_tag([word])[0][1][0].upper()
lemmatized.append(lem.lemmatize(word.lower(), tag_dict.get(tag, wordnet.NOUN)))
# ['these', 'newer', 'data', 'help', 'scientist', 'accurately', 'project', 'how', 'quickly', 'glacier', 'be', 'retreat', '.']
Whenever the PoS tagging encounters an unknown tag or a tag which the lemmatizer cannot handle, the default tag to be used is wordnet.NOUN
.
As mentioned in the beginning, alternatively to lemmatization we could use the computationally cheaper stemming, which only reduces an inflected word to its stem (e.g. accurately becomes accur). This could be done with gensim.parsing.preprocessing.stem
7
The above preprocessing steps have all been tested and work fine. Some of them can be performed independently, but we built the pipeline such that they stack. To ensure proper functionality, the order of steps has to be as follows:
- Stopword removal
- Punctuation removal
- Tokenization
- Lemmatization
Input columns have to be specified accordingly with the provided command line arguments (see readme for more info).
After the preprocessing of the data is done, we can move on to extracting features from the dataset.
The length of a tweet might influence its chance of going viral as people might prefer shorter texts on social media (or longer, more complex ones). This feature was already implemented by Lucas as an example, using len()
.
Example:
sent = 'There is great genius behind all this.'
len(sent)
# 38
This is, however simple it may be, a difficult to interpret feature: for most of its existence, Twitter has had a character limit of 140 characters per tweet. In 2017, the maximum character limit was raised to 2808, which led to an almost immediate drop of the prevalence of tweets with around 140 characters while, at the same time, tweets approaching 280 characters appear to be syntactically and semantically similar to tweets around 140 characters from before the change (Gligorić et al., 2020).
We thought that the month in which a tweet was published could have (some minor?) influence on its virality. Maybe during holiday season or the darker time of the year, i.e. from October to March, people spend more time on the internet, hence tweets might get more interaction which will lead to a higher potential of going viral.
We extracted the month from the date
column of the dataframe using the datetime
package as follows:
import datetime
date = "2021-04-14"
datetime.datetime.strptime(date, "%Y-%m-%d").month
# 4
The result we return is the respective month. We have NOT yet implemented one-hot encoding for the result because we actually decided rather quickly that we do not want to use this feature. We could not find evidence or reserach on our assumption that screentime/time on the internet is higher during certain months or periods of the year. How one-hot encoding is done can be seen in Time of Day.
Using the VADER (Valence Aware Dictionary and sEntiment Reasoner)9 10 framework, we extract the compound sentiment of a tweet. VADER was built for social media and takes into account, among other factors, emojis, punctuation, and caps - which is why we let it work on the unmodified tweet
column of the dataframe, ensuring that we do not artificially modify the sentiment. The polarity_score()
function returns a value for positive, negative, and neutral polarity, as well as an additional compound value with
Example:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sentences = ["The service here is good.",
"The service here is extremely good.",
"The service here is extremely good!!!",
"The service here is extremely good!!! ;)"]
for s in sentences:
print(sia.polarity_scores(s)['compound'])
# 0.4404
# 0.4927
# 0.6211
# 0.7389
We can see how the compound sentiment changes with the addition of words, punctuation, and emojis. We decided to only use the compound sentiment as measure because we felt that this is the most important one. A tweet might have a certain negativity score (indicating, e.g., that it is negatively phrased) because of a few words, while the rest of the tweet is phrased very positively, resulting in a positive compound sentiment. However, compared to a tweet with only neutral phrasing (i.e. a negative score of
Nota bene: We added
As opposed to the Month feature, which we ended up not using, we felt that the time of the day during which a tweet was posted might very well have an influence on its virality. For example, we suppose that less people are online during the night, decreasing a tweet's virality potential. We decided to split the day into time ranges with hard boundaries:
- Morning hours from 5am to 9am (5 hours)
- Midday from 10am to 2pm (5 hours)
- Afternoon from 3pm to 6pm (4 hours)
- Evening from 7pm to 11pm (5 hours)
- Night from 12am to 4am (5 hours)
The time sections are roughly equally sized, with afternoon being the only exception. We decided to split the day like this based on our own experience. We have not tested different splits. Alternatively, one could test
- Diffferent time ranges
- A finer split, i.e. more categories
- A less fine split, i.e. less categories
We extracted the time from the time
column of the dataframe and simply used the split()
method to extract the hour from the string it is stored in, then checked whether the extracted value falls into a predefined range and appened the respective value to a new column. We then one-hot encoded the result to retrieve a binary classification for every entry with pandas
' get_dummies()
function.
import pandas as pd
time = ["05:15:01", "07:11:31", "16:04:59", "23:12:00"]
hours = [t.split(":")[0] for t in time]
# ['05', '07', '16', '23']
result = []
for h in hours:
if hour in range(0, 6):
result.append(0)
elif hour in range(6, 11):
result.append(1)
elif hour in range(11, 15):
result.append(2)
elif hour in range(15, 19):
result.append(3)
elif hour in range(19, 24):
result.append(4)
pd.get_dummies(result)
# 05 07 16 23
#0 1 0 0 0
#1 0 1 0 0
#2 0 0 1 0
#3 0 0 0 1
# only yields encoding for 4 variables in this case because 5th category not used
Named entity recognition (NER) aims to identify so-called named entities in unstructured texts. We implemented this feature using spacy
's pre-trained en_core_web_sm
pipeline11. It has been trained on the OntoNotes 5.012 and WordNet 3.013 databases. The following entity types are supported by this model:
import spacy
spacy.info("en_core_web_sm")['labels']['ner']
# ['CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART']
The entities can be accessed as follows:
ner = en_core_web_sm.load()
sent1 = "The current Dalai Lama was born in 1935 in Tibet"
sent2 = "Big Data and AI are a hot trend in 2021 on Twitter"
ner(sent1).ents
# (1935, Tibet)
ner(sent2).ents
len(ner(sent2).ents)
# (Big Data, AI, 2021, Twitter)
# 4
As can be seen in the above example, the NER does not work perfectly. "Dalai Lama" in the first sentence is a named entity but not recognized as such. However, we still decided to make use of this feature as it classifies most named entities correctly. We counted the number of NEs per tweet and stored the occurences as integer. The pipeline we employed was specifically designed for web language. We went with the version for English - although an abundancy of other languages is available - because the number of tweets in other languages in our dataset is rather low and the model might still capture named entities, even if in another language:
sent3 = "Kanzlerin Merkel hat 16 Jahre der Bundesregierung vorgesessen."
ner.(sent3).ents
# (Kanzlerin Merkel, 16, Jahre, Bundesregierung)
Other models are offered as well, the en_core_web_sm
is a small one designed for efficiency. Alternatively, models with larger corpus or trimmed towards higher accuracy are available. Although using the more efficient version, the feature extraction step still seems to be computationally very expensive.
NER with nltk
is also possible when utilizing the pos_tag()
function, it requires a much larger effort, though, as noun phrase chunking and regular expressions have to be used for the classification.
In this section, we evaluate whether any of the above have been attached to a tweet as a binary (1 if attached, 0 else). Our thinking here was that additional media, be it a link, pictures, mentions of another account, or hashtags, might influence the potential virality of a tweet. We accessed the respective columns of the dataframe (url
, photos
, mentions
, hashtags
), in which the entries are stored in a list. Hence, we could simply evaluate the length of the entries. If they exceed a length of 2, the column contains more than just the empty brackets and the tweet contains the respective feature.
Example with URL:
urls = ["[]", "['https://www.sciencenews.org/article/climate-thwaites-glacier-under-ice-shelf-risks-warm-water']", "[]"]
result = [0 if len(url) <= 2 else 1 for url in urls]
# [0, 1, 0]
Important: although being stored in lists, the column entries get still evaluated as strings. That is why checking for a length less equal 2 works in this case. The evaluation procedure (checking for the length) is the same for all of the above features.
We also figured that the number of replies has an influence on the virality: the more people engage with a tweet and reply to it, the more people see it in their news feed, which again increases reach and interactions. The number of replies are stored as float in the column replies_count
of the dataframe, so we just have to access that column, make a copy, transform it to a numpy.array
, and reshape it so the classifier can work with the data later on:
import numpy as np
replies = [0.0, 7.0, 2.0, 49.0]
np.array(replies).reshape(-1, 1)
# array([[ 0.],
# [ 7.],
# [ 2.],
# [49.]])
Retweets and likes follow the same rationale as replies. These are the most obvious features to consider when measuring virality and we just implemented them for the purpose of testing. We did not use them for training our model (since that easily results in an accuracy numpy.array
, and reshape it.
When considering a large amount of features, we ultimately also have to think about whether they are all useful. Some features might contribute alot to the classification task at hand, while others contribute not at all. When performing classification, large and high-dimensional feature spaces - which can emerge extremely fast due to the curse of dimensionality - can become computationally very costly, so it makes sense to distuingish between important and less important features.
We decided to neither implement new dimensionality reduction methods nor the use the already provided sklearn.feature_selection.SelectKBest
procedure, since our feature vector was comprised of less than 20 features.
The data vectores in our dataset have one of two possible labels - true if the tweet went viral, false if not. We are, thus, faced with a binary classification task and need classifiers suited for this kind of task.
sklearn.dummy.DummyClassifier
Dummy classifiers make predictions without any knowledge about patterns in the data. We implemented three of them with different rules to explore possible baselines to which we could compare our real classifiers later on:
- Majority vote: always labels data as the most frequently occuring label in the data set, in our case false.
- Stratified classification: uses the data set's class distribution to assign labels.
- Uniform classification: makes random uniform predictions.
from sklearn.neighbors import KNeighborsClassifier
The k-NN classifier was implemented by Lucas during the lecture. We use it with only one hyperparameter - k - for our binary classification task. This algorithm is an example for instance-based learning. It relies on the distance between data points for classification, hence it requires standardization of feature vectors.
We decided to additionally implement a way to change the weight function. As default, the KNeighborsClassifier
works with uniform weights, i.e. all features are equally important. Having an additional option for distance-weighted classification where nearer neighbors are more important than those further away made sense for us (and it also improved our results, as can be seen later).
Other than that, though, we left the classifier with default settings. A notable alternative could have been the choice of the algorithm for computation of the nearest neighbors, options being a brute-force search, k-dimensional tree search, and ball tree search. The default option is auto
, where the classifier picks the method it deems fittest for the task at hand.
from sklearn.tree import DecisionTreeClassifier
Further, we implemented a decision tree classifier. Due to its nature of learning decision rules from the dataset, it does neither require standardization of data nor does it make assumptions on the data distribution.
We added the option to define a maximum depth of the tree, which is extremely important to cope with overfitting. Further, the criterion for measuring split quality can be chosen between Gini impurity and entropy/information gain. The former is usually preferred for classification and regression trees (CART) while the latter is used for the ID3 variant[^id3] of decision trees. Although sklearn
employs a version of the CART algorithm, it nonetheless works with entropy as measure.
Decision trees generally have difficulties working with continuous data and we have the compound sentiment (see Sentiment Analysis) as feature of such nature which is continuous in the range
from sklearn.ensemble import RandomForestClassifier
Random forest classifiers represent an ensemble of multiple decision trees. They are often more robust and accurate than single decision trees and less prone to overfitting and can, therefore, better generalize on new, unseen data (Breiman, 2001). Random forests are able to deal with unbalanced datasets by down-sampling the majority class such that each subtree works on more balanced data (Biau and Scornet, 2016).
We implemented it such that we can modify the number of trees per forest, the maximum depth per tree, as well as the criterion based on which a split occurs. The options for this are the same as for single decision trees - Gini impurity and entropy. The first two are the main parameters to look at when constructing a forest according to sklearn
's user guide on classifiers14. Usually, the classification is obtained via majority vote of the trees in a forest, but this implementation averages over the probabilistic class prediction of the single classifiers.
Being able to manipulate both the maximum depth as well as the split criterion further allows us to compare the forest to our single decision tree classifier, since we can use the same parametrization for both.
from sklearn.svm import SVC
We also added a support vector machine (SVM). This classifier seeks to find a hyperplane in the data space which maximizes the distance between different classes. It can easily deal with higher-dimensional data by changing the kernel (application of the so-called kernel-trick). sklearn
offers to choose between a linear, polynomial (default degree: 3), radial basis function, and sigmoid kernel. We decided to implement a way to change the kernel, as this can highly affect the outcome of the classifier.
SVMs are sensible to unscaled data and require standardization of the input, which we carried out using the StandardScaler()
from sklearn.preprocessing
. Class weights can be set by the parameter class_weight
to deal with unbalanced data sets, we did not implement this parameter.
from sklearn.neural_network import MLPClassifier
The multilayer perceptron (MLP) consists usually of at least three layers: one input layer, one hidden layer, and one output layer. We implemented it such that we can define the number of hidden layers as well as the number of neurons per layer. It should be noted that sklearn
's implementation of the MLP stops after 200 iterations if the network has not converged to a solution by then.
from sklearn.naive_bayes import ComplementNB
As a sixth (and last) classifier, we implemented one of the naive Bayes variants that sklearn
offers. The two classic variants are Gaussian and multinomial Bayes, yet, we chose the complement naive Bayes (CNB) algorithm as it was specifically designed to deal with unbalanced data and addresses some of the shortcomings of the multinomial variant (Rennie et al., 2003).
No additional parameters to adapt were implemented. Alternatively, we could have implemented command line argument for the smoothing parameter alpha, which adds Laplace smoothing and takes care of the zero probability problem. The default value here is
We implemented multiple evaluation metrics to see how well our classification works and to compare the different classifiers described above.
The accuracy - or fraction of correctly classified samples - might just be simplest statistic for evaluation of a classification task. It can be calculated as follows:
The best value is
The balanced accuracy is better suited for unbalanced data sets. It is based on two other commonly used metrics, the sensitivity and specificity (see section F1-Score for more details). Its calculation works as follows:
Again, values can range from
Cohen's kappa is another metric which is said to be very robust against class imbalance and, therefore, well suited for our task.
Calculation: $$ \text{Cohen's kappa} = \frac{\text{Accuracy} - p_e}{1 - p_e} $$
with
where
The Fβ-Score is a measure which combines precision and recall and returns a single value. The relative contribution from precision and recall can be adjusted with the β-value:
with
and
Values range from
mlflow ui --backend-store-uri data/classification/mlflow
After having done preprocessing and feature extraction, chosen evaluation metrics, and decided on the classifiers to employ with what kind of parameters, we ran different configurations on the training and validation set to find the most promising classifier and hyperparameter set. Listing the results of every possible combination would go beyond the scope of this documentation, which is why we will only provide an overview over all tested combinations and the most notable results. We tracked all results using the mlflow
package, which allows for very convenient logging of the used parameters and metrics.
For the k-NN classifier, we tested the following parameter combinations:
Weight | Uniform | Distance | ||||||||
k | 1 | 3 | 5 | 7 | 9 | 1 | 3 | 5 | 7 | 9 |
We used only odd numbers for k to avoid a tie since our task is binary classification (for an odd number of classes, the inverse holds true: even k-values avoid ties).
For decision trees, we explored the following hyperparameter space:
Criterion | Gini impurity | ||||||||
Max depth | 16 | 18 | 20 | 22 | 24 | 26 | 28 | 30 | 32 |
Criterion | Entropy | ||||||||
Max depth | 16 | 18 | 20 | 22 | 24 | 26 | 28 | 30 | 32 |
Additionally, we built the tree without depth restriction for both split criterions.
The random forest classifier comes with the added parameter of a set number of trees per forest:
Trees per forest | 10 | 25 | 50 | 100 |
For each possible number of trees per forest, we explored the same space as with the single decision tree. Further, we also built one tree without depth restriction for each possible combination of tree number and split criterion. A high number of trees per forest usually results in better and more solid results, especially in terms of avoiding overfitting.
We tested the SVM classifier with four different kernels:
Kernel | linear | polynomial | radial basis function | sigmoid |
The computational cost for the SVM classifier seems very high and execution of training and validation took extremely long.
We tried many different combinations for the MLP classifier: we built it first with only one hidden layer (which means three layers in total, one additional layer for input and output), then with two and three hidden layers. We tried every possible combination of neurons per hidden layer from the set of (10, 25, 50), which yields a total of 39 combinations. The hyperparameter space for a network with three hidden layers and 10 neurons in hidden layer 1 would, e.g., look like this:
Layer 1 | 10 | ||||||||
Layer 2 | 10 | 25 | 50 | ||||||
Layer 3 | 10 | 25 | 50 | 10 | 25 | 50 | 10 | 25 | 50 |
Additionally, we followed a promising lead and also trained a network with the hidden layer structure (100, 100, 100). After 200 iterations, the training was abandoned because the network had still not converged - nonetheless delivering the best result we observed thus far with the MLP. We also decided not to explore any combinations with higher number of neurons because of the computational cost.
Since we did not implement any hyperparameters to adjust, we only ran the CNB classifier once.
An important note right away: we did not use the grid of our institute for the hyperparameter optimization but only ran the classifier on a local machine. The results we obtained are from a naive exploration of the search space. We tried to narrow down interesting and promising configurations and ranges for every classifier by manual testing. Hence, we might have obtained results that are only local optima.
The results per classifier for our evaluation metrics on the validation set can be seen in the figures below:
Figure 3 shows that the majority of classifiers and configurations achieve a high accuracy, the CNB being an exception while the uniform classifier achieves an accuracy of 0.5, as expected. As already discussed earlier in the section Evaluation Metrics, this measurement does, unfortunately, not tell us much about the actual quality of the classifier.
In Figure 4, we can see that none of our classifiers performs much better than the balanced accuracy baseline of 0.5. This value can easily be obtained by classifying most to all of the tweets as false. Only the CNB pulls ahead with achieving a score og 0.622, which is still rather low.
Looking further at Cohen's kappa in Figure 5, we can now see more of a difference in the performance of the configurations. The random forest classifier with 25 trees, Gini impurity, and no specified maximum depth performs best with a value of 0.134. We can see that especially the MLP and SVM cofigurations are not useful as well. Decision tree, random forest, and k-NN have a similar performance, CNB scores equally well, given that only one configuration was run.
Figure 6 displays the F1-score, showing a similar picture to Figure 5: SVM and MLP do not perform well, while decision tree, random forest, and k-NN are again quite level in terms of mean score. We can see that even the uniform dummy classifier performs on par with our other classifiers, since it probably assigns the correct label to about half of the positive samples. Again, CNB leads the field with a score of 0.233.
Given the above results, we decided that the CNB classifier is the best choice to run on our test set. It performed best on both the balanced accuracy and F1-score, while being average on the Cohen's kappa metric.
Fig 7: Final result on test set.Figure 7 shows the result of our CNB classifier on the test set. The scores resemble those achieved on the validation set quiet closely, which confirms our choice.
In the end, we based our classifier decision not on one single metric as we had initially planned, but looked at a combination of values. That CNB worked best overall caught us by surprise, although we might just have been biased by the many configurations tested with the other classifiers compared to the single one from CNB. A random forest with Gini impurity split criterion and unlimited depth would be our second choice, the number of trees should be at least 25 (higher numbers did not yield better results, but they are possibly more robust to overfitting).
We decided to drop accuracy as decision-influencing metric because of its drawbacks when working with unbalanced data. The scores on the other three metrics are still not very satisfactory and leave much room for improvement. Our pipeline cannot be considered production ready at this point due to its substandard performance. This might have different reasons.
First of all, the features we extracted are largely metadata based. We have only implemented two natural language based features, namely the sentiment score and the named entity recognition. But even these two are without error, as our examples have shown. Sometimes, they fail to label even simple examples correctly (see sections Sentiment Analysis and Named Entity Recognition again for details). It could be that the features we extracted just do not capture what exactly makes a tweet go viral. There has been research into virality in the past, but it is not easy to capture what exactly helps a tweet (or any piece of media, for that matter) to become popular. Marketing agency OutGrow has put together an infographic with some aspects that seem to play a role in making content shareworthy15.
Further, we only did a naive hyperparameter optimization. It is possible that we found solutions only in a local optimum, while there are much more suitable classifier setups. In Figures 4-6, we can, for example, see that one MLP configuration, namely the one with hidden layer structure (100, 100, 100), outperforms the other MLPs. In this particular case, our hyperparameter space exploration was limited due to computational capability, training more complex networks has simply been not feasible for us.
On the other hand, when testing the decision tree and random forest classifiers, we also let one tree or forest grow to full depth per parameter combination. We figured this might lead to overfitting on the validation set, but the unrestricted depth configurations actually achieved best (in one case second-to-best) performance, i.e. for each evaluation metric the best performing trees and random forests where those which could grow until the end.
After having discussed the results and possible shortcomings of our pipeline, we would like to point out directions for further reseach.
As already mentioned, we only implemented two natural language grounded features. It is likely that a greater focus on this kind of feature will lead to better classification results. One could, e.g., consider n-grams, word importance measures like tf-idf, or constituency parsing and dependency parsing to measure syntactic complexity. There might also be more interesting (and more obvious) metadata features as the number of followers of a Twitter account or the history of viral posts of an account. Such features, though, seem less interesting compared to actual language based featurea - at least to us.
A more thorough and thought-through implementation of classifiers based on our first results is another feasible direction. This work can be understand as laying out some groundwork to possibly built on.
Lastly, it should not be forgotten that we worked only with a data set containing tweets about data science. While making it probably easier to work with features such as keywords when narrowing the domain, it is also much harder to get large data sets and find general patterns in the data that can be applied to new data.
https://towardsdatascience.com/comparative-study-on-classic-machine-learning-algorithms-24f9ff6ab222
Footnotes
-
https://www.nltk.org/, retrieved Oct 26, 2021 ↩
-
https://web.archive.org/web/20111226085859/http://oxforddictionaries.com/words/the-oec-facts-about-the-language, retrieved Oct 26, 2021 ↩
-
https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.remove_stopwords, retireved Oct 26, 2021 ↩
-
https://www.nltk.org/book/ch02.html, retrieved Oct 26, 2021 ↩
-
https://github.com/explosion/spaCy/blob/master/spacy/lang/en/stop_words.py, retrieved Oct 26, 2021 ↩
-
https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.strip_punctuation, retrieved Oct 26, 2021 ↩
-
https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.stem, retrieved Oct 26, 2021 ↩
-
https://blog.twitter.com/official/en_us/topics/product/2017/Giving-you-more-characters-to-express-yourself.html, retroieved Oct 29, 2021 ↩
-
https://pypi.org/project/vaderSentiment/, retrieved Oct 15, 2021 ↩
-
https://github.com/cjhutto/vaderSentiment, retrieved Oct 15, 2021 ↩
-
https://spacy.io/usage/models, retrieved Oct 31, 2021 ↩
-
https://catalog.ldc.upenn.edu/LDC2013T19, retrieved Oct 31, 2021 ↩
-
https://wordnet.princeton.edu/, retrieved Oct 31, 2021 ↩
-
https://scikit-learn.org/stable/modules/ensemble.html#forest, retrieved Oct 31, 2021 ↩
-
https://outgrow.co/blog/infographic-science-behind-virality, retrieved Oct 31, 2021 ↩