sentiment_cgsa.tex

\chapter{Message-Level Sentiment Analysis}\label{chap:cgsa}

Having familiarized ourselves with the peculiarities of the creation
of a sentiment corpus, the different ways to automatically induce new
polarity lists, and the difficulties of fine-grained opinion mining,
we now move on to the presumably most popular sentiment analysis
task---message-level sentiment analysis or MLSA, in which we need to
determine the overall polarity of a message.

Traditionally, this objective is addressed with either of the three
popular method groups:
\begin{itemize}
  \item lexicon-based approaches,
  \item machine-learning--based (ML) techniques,
  \item and deep-learning--based (DL) systems.
\end{itemize}
In this chapter, we are going to scrutinize the most successful
representatives of each of these paradigms, propose our own solution,
and also analyze errors, the utility of single components, and the
effect of additional training factors on the net results of these
methods.

%% and also tackle a much more ambitious goal, namely to check whether
%% we can achieve results comparable with the scores of these methods
%% when the language of the domain we train on is completely different
%% from the language of the test data.

We begin our comparison by first presenting two metrics that we will
use in our subsequent evaluation.  After briefly describing the data
preparation step, we proceed to the actual estimation of popular
lexicon-, ML-, and DL-based approaches, explaining and evaluating them
in Sections~\ref{sec:cgsa:lexicon-based},~\ref{sec:cgsa:ml-based},
and~\ref{sec:cgsa:dl-based}.  Finally, we conclude with an extensive
evaluation of different hyperparameters and settings (including the
impact of additional noisily labeled training data, various types of
sentiment lexicons, and text normalization), summarizing our results
and recapping our findings at the end of this part.

\section{Evaluation Metrics}\label{sec:cgsa:eval-metrics}

To estimate the quality of compared systems, we will rely on two
established evaluation metrics that are commonly used to measure MLSA
results: The first of these metrics is the \emph{macro-averaged
  \F-score} over two main polarity classes (positive and negative):
\begin{equation*}
  F_1 = \frac{F_{pos} + F_{neg}}{2}.
\end{equation*}
This measure was first introduced by the organizers of the SemEval
competition~\cite{Nakov:13,Rosenthal:14,Rosenthal:15} and has become a
de facto standard not only for the SemEval dataset but virtually for
all related message-level sentiment corpora and tasks.  This score is
supposed to emphasize the ability of a classifier to distinguish
between opposite semantic orientations.  Although it seemingly ignores
the neutral class, this type of misclassifications is indirectly taken
into account as well, because confusing the neutral label with another
polarity will automatically pull down the values of $F_{pos}$ or
$F_{neg}$.

The second metric, \emph{micro-averaged \F-score}, explicitly
considers all three semantic orientations (positive, negative, and
neutral) and essentially corresponds to the prediction accuracy on the
complete dataset~\cite[see][p.~577]{Manning:99}.  This measure both
predates and supersedes the SemEval evaluation as it had already been
used in the very first works on sentence-level opinion
mining~\cite{Wiebe:99,Das:01,Read:05,Kennedy:06,Go:09} and was
reintroduced again at the GermEval shared task
in~2017~\cite{Wojatzki:17}.

Besides these two metrics, we will also give a detailed information
about precision, recall, and \F-scores for each particular polarity
class.

\section{Data Preparation}\label{sec:cgsa:data}

As in the previous experiments, we preprocessed all tweets labeled by
the second annotator with the text normalization system
of~\citet{Sidarenka:13}, tokenized them using the same adjusted
version of Potts'
tokenizer,\footnote{\url{http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py}}
lemmatized and assigned part-of-speech tags to these tokens with the
\textsc{TreeTagger} of \citet{Schmid:95}, and obtained morphological
features and syntactic analyses with the \texttt{Mate} dependency
parser~\cite{Bohnet:13}.
% Apart from the PotTS dataset, we also applied this procedure to the
% microblogs of the German Twitter snapshot~\cite{Scheffler:14}, which
% will be used in our subsequent experiments on noisy supervision.

We again divided our corpus into training, development, and test sets,
using 70\% of the tweets for learning, 10\% for tuning and picking
optimal model parameters, and the remaining 20\% for evaluating the
results.  Drawing on the work of~\citet{Wiebe:05a}, we inferred the
polarity of these microblogs, which we will consider as gold labels in
our experiments, using a simple heuristic rule in which we assigned
the positive (negative) class to the messages that had exclusively
positive (negative) annotated \markable{sentiment}s, skipping all
microblogs that simultaneously contained multiple labeled opinions
with different semantic orientations (178 tweets). In the cases when
there was no \markable{sentiment}, we recoursed to a fallback strategy
by considering all tweets that contained exclusively positive
(negative) annotated \markable{polar term}s as positive (negative),
and ignoring all messages that featured polar elements from both
polarity classes (335 messages).\footnote{Note that we inferred all
  message-level labels based on \emph{annotated} \markable{sentiment}s
  and \markable{polar term}s and did not rely on the mere occurrence
  of positive or negative smileys, which not necessarily implied an
  expression of polarity.}  Finally, all microblogs without any
\markable{sentiment}s or \markable{polar term}s were regarded as
neutral.

A few examples of such heuristically inferred labels are provided
below:
\begin{example}[Message-Level Sentiment Annotations]\label{snt:cgsa:exmp:anno1}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
    \sentiment[polarity=positive]{Ich finde den Papst
      \emoexpression[polarity=positive]{putzig}\\
      \emoexpression[polarity=positive]{\smiley{}}}}\\
  \noindent \sentiment[polarity=positive]{I find the Pope \emoexpression[polarity=positive]{cute}\\
    \emoexpression[polarity=positive]{\smiley{}}.}\\
  \noindent\textup{\bfseries\textcolor{darkred}{Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{green3}{positive}}\\[1.5em]
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
    \emoexpression[polarity=negative]{typisch} Bayern kaum ist der
    neue Papst da und schon haben sie ihn
    \emoexpression[polarity=negative]{in der Tasche} \ldots}\\
  \noindent \emoexpression[polarity=negative]{Typical} Bavaria The new
  Pope is hardly there, as they already have him
  \emoexpression[polarity=negative]{in their pocket}\\
  \noindent\textup{\bfseries\textcolor{darkred}{Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{midnightblue}{negative}}
\end{example}
As we can see from the examples, our simple rule makes fairly
reasonable decisions, assigning the positive class to the first tweet,
which also expresses a positive sentiment, and labeling the second
message as negative, since it contains two negative polar terms
(``typisch'' [\emph{typical}] and ``in der Tasche haben'' [\emph{to
    have sb.\ in one's pocket}]).

But because our approach is still an approximation and consequently
prone to errors (especially in the cases where the polarity of the
whole microblog differs from the semantic orientation of its polar
terms, as in the first tweet in Example~\ref{snt:cgsa:exmp:anno2}, or
when it is expressed without any explicit polar terms at all, as in
the second microblog of this example), we decided to evaluate all MLSA
methods also on another German Twitter corpus,
SB10k~\cite{Cieliebak:17}, which was introduced when we already
started working on this chapter and which had been explicitly
annotated with message-level polarities of the tweets.

\begin{example}[Erroneous Sentiment
  Annotations]\label{snt:cgsa:exmp:anno2}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
    Unser Park, unser Geld, unsere Stadt! -NICHT unser Finanzminister!
    \emoexpression[polarity=positive]{\smiley{}} \#schmid \#spd \#s21
    \#btw13}\\
  \noindent Our park, our money, our city! -NOT our Finance Minister!\\
  \emoexpression[polarity=positive]{\smiley{}} \#schmid \#spd \#s21
  \#btw13\\
  \noindent\textup{\bfseries\textcolor{darkred}{Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{green3}{positive*}}\\[1.5em]
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape Auf
    die Lobby-FDP von heute kann Deutschland verzichten \ldots}\\
  \noindent Germany can go without today's lobby FDP\\
  \noindent\textup{\bfseries\textcolor{darkred}{Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{black}{neutral*}}
\end{example}

The SB10k dataset comprises a total of 9,738 microblogs, which were
sampled from a larger snapshot of 5M German tweets gathered between
August and November~2013.  To ensure lexical diversity and
proportional polarity distribution in this corpus, the authors first
grouped all posts of this snapshot into 2,500 clusters using the
$k$-means algorithm with unigram features.  Afterwards, from each of
these groups, they selected tweets that contained at least one
positive or one negative term from the German Polarity Clues
lexicon~\cite{Waltinger:10}.  Each message was subsequently annotated
by at least three human experts from a pool of 34 different
annotators.  The resulting inter-rater reliability (IRR) of this
annotation run up to 0.39 Krippendorff's
$\alpha$~\cite{Krippendorff:07}.  Unfortunately, due to the
restrictions of Twitter's terms of use, which only allow to distribute
the ids of the microblogs and their labels, we could only retrieve
7,476 tweets of this collection, which, however, still represents a
substantial part of the original dataset.

In addition to the aforementioned two corpora (PotTS and SB10k), we
also automatically annotated all microblogs of the German Twitter
Snapshot~\cite{Scheffler:14} by following the procedure
of~\citet{Read:05} and~\citet{Go:09} and assigning the positive
(negative) class to the tweets that contained respective emoticons,
regarding the rest of the microblogs as neutral.  In contrast to the
previous two datasets, whose labels were inferred or directly obtained
from manual annotations, we will not use this automatically tagged
corpus for evaluation, but will only harness it for training in our
later weak-supervision experiments.

The resulting statistics on the number of messages and polarity class
distribution in these data are shown in
Table~\ref{snt-cgsa:tbl:corp-dist}.
\begin{table}[h]
  \begin{center}
    \bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
    \begin{tabular}{p{0.163\columnwidth} % first columm
        *{6}{>{\centering\arraybackslash}p{0.135\columnwidth}}} % last two columns
      \toprule
      \textbf{Dataset} & \multicolumn{4}{c}{\bfseries Polarity Class}%
      & \multicolumn{2}{c}{\bfseries Label Agreement}\\\cmidrule(lr){2-5}\cmidrule(lr){6-7}
                       & \textbf{Positive} & \textbf{Negative} %
                                           & \textbf{Neutral} & \textbf{Mixed*} %
                                                              & $\alpha$ & $\kappa$\\\midrule

      \textbf{PotTS} & 3,380 & 1,541 & 2,558 & 513 & 0.66 & 0.4\\
      \textbf{SB10k} & 1,717 & 1,130 & 4,629 & 0 & 0.39 & \NA{}\\
      \textbf{GTS} & 3,326,829 & 350,775 & 19,453,669 & 73,776 & \NA{} & \NA{}\\\bottomrule
    \end{tabular}
    \egroup{}
    \caption[Polarity class distribution in PotTS, SB10k, and the
    German Twitter Snapshot]{Polarity class distribution in PotTS,
      SB10k, and the German
      Twitter Snapshot (GTS)\\
      \emph{(* --- the \emph{mixed} polarity was excluded from our
        experiments)}}\label{snt-cgsa:tbl:corp-dist}
  \end{center}
\end{table}

As we can see, each dataset has its own unique composition of polar
tweets: The PotTS corpus, for example, shows a conspicuous bias
towards the positive class, with 42\% of its microblogs belonging to
this polarity.  We can partially explain this skewness by the
selection criteria that we used to compile the initial data for this
collection: Because a big part of this dataset was composed from
tweets that contained smileys, and most of these emoticons were
positive, which is evident from the statistics of the German Twitter
snapshot, the selected microblogs also got biased towards this
semantic orientation.

The second most frequent group in the PotTS corpus are neutral tweets,
which account for 32\% of the data.  Negative messages, vice versa,
represent a clear minority in this collection (only 19\%), which,
however, is less surprising as the same tendency can be observed for
SB10k and the German Twitter Snapshot too.

Regarding the last two corpora, we can observe a more uniform (though
not identical) behavior, where both datasets are dominated by neutral
posts, which constitute 62\% of SB10k and 84\% of all snapshot tweets.
The positive class, again, makes up a big part of these data (23\% of
the former corpus and 14\% of the latter dataset), but its influence
this time is much less pronounced than in the PotTS case.  Finally,
negative tweets are again the least represented semantic orientation.
The only group that has even less instances than this class is the
\textsc{Mixed} polarity.  We, however, will skip the mixed orientation
in our experiments for the sake of simplicity and uniformity of
evaluation.

%% Last but not least, the results of the inter-rater reliability test
%% confirm the superior quality of the PotTS corpus, which, even despite
%% its approximate labels, still has an $\alpha$
%% agreement~\cite{Krippendorff:07} that is almost 1.7 times as high as
%% the respective score of the SB10k set (0.66 versus 0.39).

%%   However, Cohen's $\kappa$ of these data (0.4), which is only
%% available for our corpus, is merely on the verge between fair and
%% moderate agreement.  Nevertheless, since labels used in our
%% experiments are ordinal rather than nominal in their nature (\ie{}
%% we can compute the \emph{distance} between distinct labels, which,
%% for example, would be greater for the pair \textsc{positive}
%% vs. \textsc{negative} than for the pair \textsc{positive}
%% vs. \textsc{neutral}), we think that the Krippendorff's metric is
%% more appropriate for assessing the quality of the annotation for
%% this task.

\section{Lexicon-Based Methods}\label{sec:cgsa:lexicon-based}

The first group of approaches that we are going to explore in this
chapter using the aforementioned data are lexicon-based (LB) systems.
Just like sentiment lexicons themselves, LB methods for message-level
opinion mining have attracted a lot of attention from the very
inception of the sentiment analysis field.  Starting from the work
of~\citet{Hatzivassi:00}, who gave a statistical proof that the mere
occurrence of a subjective adjective from an automatically compiled
polarity list was a sufficiently reliable indicator that the whole
sentence was subjective, more and more researchers started using
lexicons in order to estimate the overall polarity of a text.

One of the first notable steps in this direction was made
by~\citet{Das:01}, who proposed an ensemble of five classifiers (two
of which were purely lexicon-based and the other three heavily relied
on lexicon features) to predict the polarity of stock messages,
achieving an accuracy of 62\% on a corpus of several hundreds stock
board messages.  A much simpler method for a related task was
suggested by~\citet{Turney:02}, who determined the \emph{semantic
  orientation} (SO) of reviews by averaging the PMI scores of their
terms, getting these scores from an automatically generated sentiment
lexicon.  With this approach, the author could reach an accuracy of
74\% on a corpus of 410 manually labeled Epinions comments.  In the
same vein, \citet{Hu:04} computed the overall polarity of a sentence
by comparing the numbers of its positive and negative terms, reversing
their orientation if they appeared in a negated context.

%% Finally, \citet{Kim:04} compared three different approaches to
%% determining the polarity of a
%% sentence: \begin{inparaenum}[(i)] \item by multiplying the signs of
%% its polar terms, \item by taking the sum of their scores, and \item
%% by computing the geometric mean of these values; \end{inparaenum}
%% finding the first and the last option working best on the Document
%% Understanding Corpus.\footnote{\url{http://duc.nist.gov/}}

% % Hu and Liu, 2004
% Similarly, \citet{Hu:04} determined the semantic orientation of
% sentences in customer reviews by simply comparing the number of
% positive and negative terms found in these passages. Since the
% authors, however, were primarily interested in estimating the polarity
% towards particular product features mentioned in the clauses, they
% additionally applied a fallback strategy in case of a tie by checking
% which of the polar lexicon terms appeared closer to the features, and
% assuming the polarity of the preceding sentence if these numbers were
% also equal.

% % Taboada et al., 2004
% Largely inspired by the Appraisal theory of~\citet{Martin:00},
% \citet{Taboada:04} enhanced the original method of~\citet{Turney:02}
% by increasing the weights of polar adjectives which occurred in the
% middle and at the end of a document, and also augmenting these values
% with the affect, judgement, and appreciation scores.  Similarly to
% polarity, the appraisal scores were calculated automatically by
% computing the PMI of their cooccurrence with different pronouns using
% a web search engine.

% Polanyi and Zaenen, 2006; Kennedy and Inkpen, 2006
In~\citeyear{Polanyi:06}, \citeauthor{Polanyi:06} presented an
extensive overview and analysis of common lexicon-based sentiment
methods that existed at that time, arguing that besides considering
the lexical valence (\ie{} semantic orientation) of polar expressions,
it was also important to incorporate syntactic, discourse-level, and
extra-linguistic factors such as negations, intensifiers, modal
operators (\eg{} \emph{could} or \emph{might}), presuppositional items
(\eg{} \emph{barely} or \emph{failure}), irony, reported speech,
discourse connectors, genre, attitude assessment, reported speech, and
multi-entity evaluation.  This theoretical hypothesis was also proven
empirically by \citet{Kennedy:06}, who investigated two ways to
determine the polarity of a customer review: In the first approach,
the authors simply compared the numbers of positive and negative terms
in the text, assigning the review to the class with the greater number
of items.  In the second attempt, they enhanced the original system
with an additional information about contextual valence shifters,
increasing or decreasing the sentiment score of a term if it was
preceded by an intensifier or downtoner, and changing the polarity
sign of this score to the opposite in case of a negation.  %% With this
%% adjustment, \citeauthor{Kennedy:06} achieved a statistically
%% significant improvement, boosting the accuracy of the two-class
%% prediction from 67.9 to 69.3\%.

% Taboada et al., 2011
Finally, a seminal work on lexicon-based techniques was presented
by~\citet{Taboada:11}, who introduced a manually compiled polarity
list\footnote{The authors hand-annotated all occurrences of
  adjectives, nouns, and verbs found in a corpus of 400 Epinions
  reviews with ordinal categories ranging from -5 to 5 that reflected
  the semantic orientation of a term (positive vs.\ negative) and its
  polar strength (weak vs.\ strong).} and used this resource to
estimate the overall semantic orientation of texts.  Drawing on the
ideas of~\citet{Polanyi:06}, the authors incorporated a set of
additional heuristic rules into their computation by changing the
prior SO values of negated, intensified, and downtoned terms, ignoring
irrealis and interrogative sentences, and adjusting the weights of
specific document sections.  An extensive evaluation of this approach
showed that the manual lexicon performed much better than
automatically generated polarity lists, such as Subjectivity
Dictionary~\cite{Wilson:05}, Maryland Polarity Set~\cite{Mohammad:09},
and \textsc{SentiWordNet} of~\citet{Esuli:06c}.  Moreover, the authors
also demonstrated that their method could be successfully applied to
other topics and genres, hypothesizing that lexicon-based approaches
were in general more amenable to domain shifts than traditional
supervised machine-learning techniques.

% % Taboada et al., 2006
% Another important contribution to the development of lexicon-based
% approaches was made by~\citet{Taboada:06}, who compared three popular
% polarity lists---a PMI lexicon computed with the original method
% of~\citet{Turney:02} using the AltaVista's NEAR operator; a similar
% polarity list obtained with the help of Google's AND queries; and,
% finally, the manually compiled General Inquirer lexicon
% of~\citet{Stone:66}.  The authors evaluated these resources both
% intrinsically (by comparing them with GI entries) and extrinsically
% (by computing the polarity of 400 manually annotated Epinions
% reviews).  To estimate the overall polarity of a review for the second
% task, \citeauthor{Taboada:06} calculated the average SO value of all
% polar terms found in the review, obtaining these scores from the
% mean-normalized lexicons, and flipping the polarity sign to the
% opposite in case of the negation.

% Musto et al., 2014
It is therefore not surprising that lexicon-based systems have also
quickly found their way into the sentiment analysis of social media:
For example, one such approach, explicitly tailored to Twitter
specifics, was proposed by~\citet{Musto:14}, who examined four
different ways to compute the overall polarity scores of microblogs:
\emph{basic}, \emph{normalized}, \emph{emphasized}, and
\emph{normalized-emphasized}.  %% ; evaluating these strategies with
%% four distinct lexicons:
%% \textsc{Sen\-ti\-Word\-Net}~\cite{Esuli:06c},
%% \textsc{Word\-Net-\-Affect}~\cite{Strapparava:04},
%% \textsc{MPQA}~\cite{Wiebe:05}, and
%% \textsc{SenticNet}~\cite{Cambria:14}.
In each of these methods, the authors first split the input message
into a list of \emph{micro-phrases} based on the occurrence of
punctuation marks and conjunctions.  Afterwards, they calculated the
polarity score for each of these segments and finally estimated the
overall polarity of the whole tweet by uniting the scores of its
micro-phrases.  \citeauthor{Musto:14} obtained their best results
(58.99\% accuracy on the SemEval-2013 dataset) with the
normalized-emphasized approach, in which they averaged the polarity
scores of segments' tokens, boosting these values by 50\% for
adjectives, adverbs, nouns, and verbs; and computed the final overall
polarity of the microblog by taking the sum of all micro-phrase
scores.

% the authors obtained their best results using the
% \textsc{SentiWordNet} lexicon of~\citet{Esuli:06c}

% Jurek et al., 2015
Another Twitter-aware system was presented by~\citet{Jurek:15}, who
computed the negative and positive polarity of a message ($F_p$ and
$F_n$ respectively) as:
\begin{align}
  \small
  \begin{split}
  F_P &= \min\left(\frac{A_P}{2 - \log(3.5\times W_P + I_P)}, 100\right),\\
  F_N &= \max\left(\frac{A_N}{2 - \log(3.5\times W_N + I_N)}, -100\right);\label{cgsa:eq:jurek}
  \end{split}
\end{align}%
where $A_P$ and $A_N$ represent the average scores of positive and
negative lexicon terms found in the tweet; $W_P$ and $W_N$ stand for
the raw counts of polar tokens; and $I_P$ and $I_N$ denote the number
of intensifiers preceding these words.  In addition to that, before
estimating the average values, the authors modified the polarity
scores $s_w$ of all negated words $w$ using the following rule:
\begin{align*}
  \small%
neg(s_w) =
    \begin{cases}
        \min\left(\frac{s_w - 100}{2}, -10\right) & \text{if } s_w > 0,\\
        \max\left(\frac{s_w + 100}{2}, 10\right), & \text{if } s_w < 0.
    \end{cases}
\end{align*}%
Furthermore, besides computing the polarity scores $F_p$ and $F_n$,
\citeauthor{Jurek:15} also determined the subjectivity degree of the
message by replacing the $A_P$ and $A_N$ terms in
Equation~\ref{cgsa:eq:jurek} with the average of conditional
probabilities of the tweet being subjective given the occurrences of
the respective polar terms.\footnote{These probabilities were
  calculated automatically on the noisily labeled data set
  of~\citet{Go:09}.}  The authors considered a microblog as neutral if
its absolute polarity was less than 25, and the subjectivity value was
not greater than 0.5.  Otherwise, they assigned a positive or negative
label to this message depending on the sign of the polarity score.
With this approach, \citeauthor{Jurek:15} achieved an accuracy
of~77.3\% on the manually annotated subset of the \citeauthor{Go:09}'s
corpus and reached 74.2\% on the IMDB review dataset~\cite{Maas:11}.

% Kolchyna et al., 2015
Finally, \citet{Kolchyna:15} also explored two different ways of
computing the overall polarity of a microblog:
\begin{inparaenum}[(i)]
\item by simply averaging the scores of all lexicon terms found in the
  message and
\item by taking a signed logarithm of this average:
\end{inparaenum}
\begin{equation*}
  \text{Score}_{\log} =
  \begin{cases}
    \text{sign}(\text{Score}_{\text{AVG}})\log_{10}(|\text{Score}_{\text{AVG}}|) & %
    \text{if |Score}_{\text{AVG}}| > 0.1,\\
    0, & \text{otherwise};
  \end{cases}
\end{equation*}%
The authors determined the final polarity of a tweet by using
$k$-means clustering, which utilized both of the above polarity values
as features.  They showed that the logarithmic strategy performed
better than the simple average solution, yielding an accuracy of
61.74\% on the SemEval-2013 corpus~\cite{Nakov:13}.

%% In addition to that,
% \citeauthor{Kolchyna:15} also checked whether plain lexicon scores
% could serve as useful attributes for an ML-based method.  For this
% purpose, they retrained a cost-sensitive SVM
% classifier~\cite{Masnadi:12} after extending its $n$-gram feature
% set with lexicon features, getting almost five percent accuracy
% improvement (from 86.62 to 91.17) on the IMDB movie review
% dataset~\cite{Pang:02}.

As it was unclear how each of these methods would perform on PotTS and
SB10k, we reimplemented the approaches of~\citet{Hu:04} (as a
relatively simple baseline), \citet{Taboada:11}, \citet{Musto:14},
\citet{Jurek:15}, and \citet{Kolchyna:15}, and applied these systems
to the test sets of these corpora.

Based on our comparison in Chapter~\ref{chap:snt:lex}, we chose the
Zurich Polarity List~\cite{Clematide:10} as the primary sentiment
lexicon for the tested methods.  However, a significant drawback of
this resource is that most of its entries have uniform weights, with
their polarity scores being either 0.7 or 1.  We decided to keep the
original values as is, and only multiplied the scores of negative
terms by -1, since all of the tested approaches presupposed different
signs for the terms with opposite semantic orientations.\footnote{We
  will investigate the impact of other lexicons with presumably better
  scoring later in Section~\ref{cgsa:subsec:eval:lexicons}.}
Moreover, because some analyzers (\eg{} \citeauthor{Taboada:11}
[\citeyear{Taboada:11}] and \citeauthor{Musto:14}
[\citeyear{Musto:14}]) required part-of-speech tags of lexicon
entries, we automatically tagged all terms in this polarity list with
the \textsc{TreeTagger}~\cite{Schmid:95}, choosing the most probable
part-of-speech tag for each entry and also using the tag sequences
whose probabilities were at least two times lower than the likelihood
of the best assignment, duplicating the lexicon entries in the second
case.

Furthermore, since all of the systems except for that
of~\citet{Kolchyna:15} by default returned continuous real values, but
our evaluation required discrete polarity labels (\emph{positive},
\emph{negative}, or \emph{neutral}), we discretized the results of
these approaches using the following simple procedure: We first
determined the optimal threshold values for each particular polar
class on the training and development sets,\footnote{Since none of the
  methods required training or involved any sophisticated
  hyper-parameters, we used both training and development data to
  optimize the threshold scores.} and then derived polarity labels for
the test messages by comparing their predicted SO scores with these
thresholds.  To achieve the former goal (\ie{} to find the optimal
thresholds), we exhaustively searched through all unique polarity
values assigned to the training and development instances and checked
whether using these values as a boundary between two adjacent polarity
classes (sorted in ascending order of their positivity) would increase
the overall macro-\F{} on the training and development sets.

The final results of this evaluation are shown in
Table~\ref{snt-cgsa:tbl:lex-res}.

\begin{table}[h]
  \begin{center}
    \bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
    \begin{tabular}{p{0.162\columnwidth} % first columm
        *{9}{>{\centering\arraybackslash}p{0.074\columnwidth}} % next nine columns
        *{2}{>{\centering\arraybackslash}p{0.068\columnwidth}}} % last two columns
      \toprule
      \multirow{2}*{\bfseries Method} & %
      \multicolumn{3}{c}{\bfseries Positive} & %
      \multicolumn{3}{c}{\bfseries Negative} & %
      \multicolumn{3}{c}{\bfseries Neutral} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Macro\newline \F{}$^{+/-}$} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Micro\newline \F{}}\\
      \cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}

      & Precision & Recall & \F{} & %
      Precision & Recall & \F{} & %
      Precision & Recall & \F{} & & \\\midrule

      \multicolumn{12}{c}{\cellcolor{cellcolor}PotTS}\\

      % Hu-Liu Commands:
      % -----------------
      % cgsa_sentiment train -t hu-liu -l cgsa/data/lexicons/zrch.manual.txt \
      % data/PotTS/preprocessed/train/*.tsv data/PotTS/preprocessed/dev/*.tsv
      %
      % cgsa_sentiment test -m cgsa/data/models/cgsa.model data/PotTS/preprocessed/test/*.tsv\
      % > data/PotTS/preprocessed/predicted/hu-liu/hu-liu.test
      %
      % cgsa_evaluate data/PotTS/preprocessed/test/ \
      % data/PotTS/preprocessed/predicted/hu-liu/hu-liu.test
      %
      % Hu-Liu Results:
      % ----------------
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.75      0.76      0.76       680
      % negative       0.53      0.43      0.47       287
      % neutral       0.67      0.73      0.69       558
      % avg / total       0.68      0.69      0.68      1525
      % Macro-Averaged F1-Score (Positive and Negative Classes): 61.51%
      % Micro-Averaged F1-Score (All Classes): 68.5246%

      HL & 0.75 & \textbf{0.76} & \textbf{0.76} & %
       0.53 & 0.43 & 0.47 & %
       0.67 & 0.73 & 0.69 & %
       \textbf{0.615} & \textbf{0.685}\\

       % Taboada Commands:
       % -----------------
       % cgsa_sentiment train -t taboada -l cgsa/data/lexicons/zrch.manual.txt \
       % data/PotTS/preprocessed/train/*.tsv data/PotTS/preprocessed/dev/*.tsv
       %
       % cgsa_sentiment test -m cgsa/data/models/cgsa.model data/PotTS/preprocessed/test/*.tsv\
       % > data/PotTS/preprocessed/predicted/taboada/taboada.test
       %
       % cgsa_evaluate data/PotTS/preprocessed/test/ \
       % data/PotTS/preprocessed/predicted/taboada/taboada.test
       %
       % Taboada Results:
       % ----------------
       % General Statistics:
       % precision    recall  f1-score   support
       % positive       0.77      0.71      0.74       680
       % negative       0.54      0.39      0.45       287
       % neutral       0.63      0.77      0.69       558
       % avg / total       0.67      0.67      0.67      1525
       % Macro-Averaged F1-Score (Positive and Negative Classes): 59.66%
       % Micro-Averaged F1-Score (All Classes): 67.4098%
      TBD & \textbf{0.77} & 0.71 & 0.74 & %
        \textbf{0.54} & 0.39 & 0.45 & %
        0.63 & 0.77 & 0.69 & %
        0.597 & 0.674\\

       % Musto Commands:
       % -----------------
       % cgsa_sentiment train -t musto -l cgsa/data/lexicons/zrch.manual.txt \
       % data/PotTS/preprocessed/train/*.tsv data/PotTS/preprocessed/dev/*.tsv
       %
       % cgsa_sentiment test -m cgsa/data/models/cgsa.model data/PotTS/preprocessed/test/*.tsv\
       % > data/PotTS/preprocessed/predicted/musto/musto.test
       %
       % cgsa_evaluate data/PotTS/preprocessed/test/ \
       % data/PotTS/preprocessed/predicted/musto/musto.test
       %
       % Musto Results:
       % ----------------
       % General Statistics:
       % precision    recall  f1-score   support
       % positive       0.75      0.72      0.74       680
       % negative       0.48      0.47      0.48       287
       % neutral       0.68      0.72      0.70       558
       % avg / total       0.68      0.68      0.67      1525
       % Macro-Averaged F1-Score (Positive and Negative Classes): 60.56%
       % Micro-Averaged F1-Score (All Classes): 67.5410%

       MST & 0.75 & 0.72 & 0.74 & %
        0.48 & \textbf{0.47} & \textbf{0.48} & %
        \textbf{0.68} & 0.72 & 0.7 & %
        0.606 & 0.675\\

       % Jurek Commands:
       % ---------------
       % cgsa_sentiment train -t jurek -l cgsa/data/lexicons/zrch.manual.txt \
       % data/PotTS/preprocessed/train/*.tsv data/PotTS/preprocessed/dev/*.tsv
       %
       % cgsa_sentiment test -m cgsa/data/models/cgsa.model data/PotTS/preprocessed/test/*.tsv\
       % > data/PotTS/preprocessed/predicted/jurek/jurek.test
       %
       % cgsa_evaluate data/PotTS/preprocessed/test/ \
       % data/PotTS/preprocessed/predicted/jurek/jurek.test
       %
       % Jurek Results:
       % ----------------
       % General Statistics:
       % precision    recall  f1-score   support
       % positive       0.60      0.31      0.41       680
       % negative       0.42      0.20      0.27       287
       % neutral       0.43      0.80      0.56       558
       % avg / total       0.50      0.47      0.44      1525
       % Macro-Averaged F1-Score (Positive and Negative Classes): 33.94%
       % Micro-Averaged F1-Score (All Classes): 46.6885%

      JRK & 0.6 & 0.31 & 0.41 & %
       0.42 & 0.2 & 0.27 & %
       0.43 & 0.8 & 0.56 & %
       0.339 & 0.467\\

       % Kolchyna Commands:
       % ------------------
       % cgsa_sentiment train -t kolchyna -l cgsa/data/lexicons/zrch.manual.txt \
       % data/PotTS/preprocessed/train/*.tsv data/PotTS/preprocessed/dev/*.tsv
       %
       % cgsa_sentiment test -m cgsa/data/models/cgsa.model data/PotTS/preprocessed/test/*.tsv\
       % > data/PotTS/preprocessed/predicted/kolchyna/kolchyna.test
       %
       % cgsa_evaluate data/PotTS/preprocessed/test/ \
       % data/PotTS/preprocessed/predicted/kolchyna/kolchyna.test
       %
       % Kolchyna Results:
       % -----------------
       % General Statistics:
       % precision    recall  f1-score   support
       % positive       0.71      0.72      0.71       680
       % negative       0.34      0.17      0.22       287
       % neutral       0.66      0.82      0.73       558
       % avg / total       0.62      0.65      0.63      1525
       % Macro-Averaged F1-Score (Positive and Negative Classes): 46.76%
       % Micro-Averaged F1-Score (All Classes): 65.1148%

      KLCH & 0.71 & 0.72 & 0.71 & %
       0.34 & 0.17 & 0.22 & %
       0.66 & \textbf{0.82} & \textbf{0.73} & %
       0.468 & 0.651\\

      \multicolumn{12}{c}{\cellcolor{cellcolor}SB10k}\\

      % Training hu-liu
      % Testing hu-liu
      % Evaluating hu-liu
      % General Statistics:
      % Training hu-liu
      % Testing hu-liu
      % Evaluating hu-liu
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.49      0.62      0.55       354
      % negative       0.27      0.33      0.30       212
      % neutral       0.73      0.62      0.67       930
      % avg / total       0.61      0.58      0.59      1496
      % Macro-Averaged F1-Score (Positive and Negative Classes): 42.09%
      % Micro-Averaged F1-Score (All Classes): 57.6872%
      HL & \textbf{0.49} & \textbf{0.62} & \textbf{0.55} & %
        0.27 & 0.33 & 0.3 & %
        \textbf{0.73} & 0.62 & 0.67 & %
        \textbf{0.421} & 0.577\\

        % Training taboada
        % Testing taboada
        % Evaluating taboada
        % General Statistics:
        % precision    recall  f1-score   support
        % positive       0.48      0.60      0.53       354
        % negative       0.24      0.27      0.25       212
        % neutral       0.72      0.63      0.67       930
        % avg / total       0.59      0.57      0.58      1496
        % Macro-Averaged F1-Score (Positive and Negative Classes): 39.33%
        % Micro-Averaged F1-Score (All Classes): 57.0187%

      TBD & 0.48 & 0.6 & 0.53 & %
        0.24 & 0.27 & 0.25 & %
        0.72 & 0.63 & 0.67 & %
        0.393 & 0.57\\

        % Training musto
        % Testing musto
        % Evaluating musto
        % General Statistics:
        % precision    recall  f1-score   support
        % positive       0.45      0.49      0.47       354
        % negative       0.29      0.35      0.32       212
        % neutral       0.70      0.64      0.67       930
        % avg / total       0.59      0.57      0.58      1496
        % Macro-Averaged F1-Score (Positive and Negative Classes): 39.54%
        % Micro-Averaged F1-Score (All Classes): 56.7513%
      MST & 0.45 & 0.49 & 0.47 & %
        0.29 & \textbf{0.35} & \textbf{0.32} & %
        0.7 & 0.64 & 0.67 & %
        0.395 & 0.568\\

        % Training jurek
        % Testing jurek
        % Evaluating jurek
        % General Statistics:
        % precision    recall  f1-score   support
        % positive       0.41      0.39      0.40       354
        % negative       0.36      0.26      0.30       212
        % neutral       0.69      0.75      0.72       930
        % avg / total       0.58      0.59      0.58      1496
        % Macro-Averaged F1-Score (Positive and Negative Classes): 35.06%
        % Micro-Averaged F1-Score (All Classes): 59.2246%
      JRK & 0.41 & 0.39 & 0.4 & %
        \textbf{0.36} & 0.26 & 0.3 & %
        0.69 & 0.75 & 0.72 & %
        0.351 & 0.592\\

        % Training kolchyna
        % Testing kolchyna
        % Evaluating kolchyna
        % General Statistics:
        % precision    recall  f1-score   support
        % positive       0.39      0.22      0.28       354
        % negative       0.34      0.13      0.19       212
        % neutral       0.66      0.86      0.75       930
        % avg / total       0.55      0.61      0.56      1496
        % Macro-Averaged F1-Score (Positive and Negative Classes): 23.47%
        % Micro-Averaged F1-Score (All Classes): 60.6283%
      KLCH & 0.39 & 0.22 & 0.28 & %
        0.34 & 0.13 & 0.19 & %
        0.66 & \textbf{0.86} & \textbf{0.75} & %
        0.235 & \textbf{0.606}\\\bottomrule
    \end{tabular}
    \egroup{}
    \caption[Results of lexicon-based MLSA methods]{
      Results of lexicon-based MLSA methods\\
      {\small HL~--~\citet{Hu:04}, TBD~--~\citet{Taboada:11}, MST~--~\citet{Musto:14},
        JRK~--~\citet{Jurek:15}, KLCH~--~\citet{Kolchyna:15}}}\label{snt-cgsa:tbl:lex-res}
  \end{center}
\end{table}

As we can see, the performance of the tested methods significantly
varies across different polarity classes, but follows more or less the
same pattern on both datasets: For example, the most simple approach
of~\citet{Hu:04} achieves surprisingly good quality at predicting
positive tweets, showing the highest recall and \F{}-measure on the
PotTS corpus and yielding the best overall scores for this polarity
class on the SB10k set.  Moreover, on the latter data, it also
outperforms all other systems in terms of the precision of neutral
microblogs.  Combined with its generally good results on other
metrics, this classifier attains the highest macro-averaged
\F{}-result for all classes and sets up a new benchmark for the
micro-\F{} on the PotTS test set.

The approach of~\citet{Taboada:11}, which can be viewed as an
extension of the previous method, only surpasses the HL classifier
w.r.t.\ the precision of positive and negative messages, but still
loses more than 0.02 macro-\F{} due to a lower recall of the neutral
class.  A better performance in this regard is shown by the analyzer
of~\citet{Musto:14}, which shows a fairly strong recall of negative
tweets, which in turn leads to the best \F{}-score for this polarity.
Unfortunately, since this semantic orientation is the most
underrepresented one in both corpora, this success is not reflected in
the overall statistics: Although this methods ranks second in terms of
the macro-averaged \F{}, it lags behind its competitors with regard to
the micro-averaged value on the SB10k corpus.

Finally, the system of~\citet{Kolchyna:15} shows very strong recall
and \F{}-scores for the neutral class on both sets and also achieves
the best accuracy (0.606) on the SB10k data, but its quality for the
remaining two polarities is fairly suboptimal, with the \F{}-scores
for these semantic orientations ranking last or second to last in both
cases.

\subsection{Polarity-Changing Factors}\label{subsec:cgsa:lex-methods:pol-change}

Since the analysis of context factors is commonly considered to be one
of the most important components of any lexicon-based MLSA system, and
because the method with the simplest approach to this task achieved
surprisingly good results, outperforming other more sophisticated
competitors, we decided to recheck the utility of this module for all
classifiers.  In order to do so, we successively deactivated, one by
one, parts of the classifiers that analyzed the surrounding context of
polar terms and recomputed the \F{}-scores of all systems after these
changes.

\begin{table}[h]
  \begin{center}
    \bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
    \begin{tabular}{p{0.16\columnwidth} % first columm
        *{10}{>{\centering\arraybackslash}p{0.082\columnwidth}}}
      \toprule
      \multirow{2}{0.15\columnwidth}{%
      \bfseries Polarity-Changing\newline Factors} & %
      \multicolumn{10}{c}{\bfseries System Scores}\\
      & \multicolumn{2}{c}{\bfseries HL} & \multicolumn{2}{c}{\bfseries TBD} %
      & \multicolumn{2}{c}{\bfseries MST} %
      & \multicolumn{2}{c}{\bfseries JRK} & \multicolumn{2}{c}{\bfseries KLCH}\\%
      \cmidrule(lr){2-3}\cmidrule(lr){4-5}\cmidrule(lr){6-7} %
      \cmidrule(lr){8-9}\cmidrule(lr){10-11}

      & Macro\newline \F{}$^{+/-}$ & Micro\newline \F{} %
      & Macro\newline \F{}$^{+/-}$ & Micro\newline \F{} %
      & Macro\newline \F{}$^{+/-}$ & Micro\newline \F{} %
      & Macro\newline \F{}$^{+/-}$ & Micro\newline \F{} %
      & Macro\newline \F{}$^{+/-}$ & Micro\newline \F{}\\\midrule

      \multicolumn{11}{c}{\cellcolor{cellcolor}PotTS}\\
      All & 0.615 & 0.685 & 0.593 & 0.671 & 0.606 & 0.675 %
      & 0.339 & 0.467 & 0.468 & 0.651\\

      % Training hu-liu
      % Testing hu-liu
      % Evaluating hu-liu
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.76      0.77      0.76       680
      % negative       0.55      0.43      0.48       287
      % neutral       0.67      0.73      0.70       558
      % avg / total       0.69      0.69      0.69      1525
      % Macro-Averaged F1-Score (Positive and Negative Classes): 62.21%
      % Micro-Averaged F1-Score (All Classes): 69.1148%

      % Training taboada
      % Testing taboada
      % Evaluating taboada
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.78      0.71      0.74       680
      % negative       0.54      0.39      0.45       287
      % neutral       0.62      0.78      0.69       558
      % avg / total       0.67      0.67      0.67      1525
      % Macro-Averaged F1-Score (Positive and Negative Classes): 59.63%
      % Micro-Averaged F1-Score (All Classes): 67.2787%

      % Training musto
      % Testing musto
      % Evaluating musto
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.76      0.78      0.77       680
      % negative       0.57      0.47      0.51       287
      % neutral       0.68      0.72      0.70       558
      % avg / total       0.69      0.70      0.70      1525
      % Macro-Averaged F1-Score (Positive and Negative Classes): 64.05%
      % Micro-Averaged F1-Score (All Classes): 69.9672%

      % Training jurek
      % Testing jurek
      % Evaluating jurek
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.61      0.32      0.42       680
      % negative       0.47      0.22      0.30       287
      % neutral       0.43      0.80      0.56       558
      % avg / total       0.52      0.47      0.45      1525
      % Macro-Averaged F1-Score (Positive and Negative Classes): 35.67%
      % Micro-Averaged F1-Score (All Classes): 47.3443%

      % Training kolchyna
      % Testing kolchyna
      % Evaluating kolchyna
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.66      0.24      0.35       680
      % negative       0.35      0.19      0.24       287
      % neutral       0.44      0.87      0.58       558
      % avg / total       0.52      0.46      0.42      1525
      % Macro-Averaged F1-Score (Positive and Negative Classes): 29.82%
      % Micro-Averaged F1-Score (All Classes): 46.2951%
      --Negation & 0.622 & \textbf{0.691} & 0.596 & 0.672 & \textbf{0.641} & %
       0.7 & 0.357 & 0.473 & 0.298 & 0.463\\


       % Training taboada
       % Testing taboada
       % Evaluating taboada
       % General Statistics:
       % precision    recall  f1-score   support
       % positive       0.77      0.71      0.74       680
       % negative       0.54      0.39      0.45       287
       % neutral       0.62      0.78      0.69       558
       % avg / total       0.67      0.67      0.67      1525
       % Macro-Averaged F1-Score (Positive and Negative Classes): 59.53%
       % Micro-Averaged F1-Score (All Classes): 67.2131%

       % Training jurek
       % Testing jurek
       % Evaluating jurek
       % General Statistics:
       % precision    recall  f1-score   support
       % positive       0.60      0.31      0.41       680
       % negative       0.42      0.20      0.27       287
       % neutral       0.43      0.80      0.56       558
       % avg / total       0.50      0.47      0.44      1525
       % Macro-Averaged F1-Score (Positive and Negative Classes): 33.94%
       % Micro-Averaged F1-Score (All Classes): 46.6885%
      --Intensification & \NA{} & \NA{} & 0.595 & 0.672 & \NA{} &  %
      \NA{} & 0.339 & 0.467 & \NA{} & \NA{}\\

      % Training taboada
      % Testing taboada
      % Evaluating taboada
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.77      0.75      0.76       680
      % negative       0.54      0.41      0.47       287
      % neutral       0.64      0.75      0.69       558
      % avg / total       0.68      0.68      0.68      1525
      % Macro-Averaged F1-Score (Positive and Negative Classes): 61.29%
      % Micro-Averaged F1-Score (All Classes): 68.3934%
      --Other Modifiers & \NA{} & \NA{} & 0.613 & 0.684 & \NA{} &  %
      \NA{} & \NA{} & \NA{} & \NA{} & \NA{}\\

      \multicolumn{11}{c}{\cellcolor{cellcolor}SB10k}\\

      All & \textbf{0.421} & 0.577 & 0.392 & 0.569 & 0.395 & 0.568 %
      & 0.351 & 0.592 & 0.235 & 0.606\\

      % Training hu-liu
      % Testing hu-liu
      % Evaluating hu-liu
      % General Statistics:
      % precision    recall  f1-score   support
      % negative       0.26      0.31      0.28       212
      % neutral       0.73      0.62      0.67       930
      % positive       0.48      0.63      0.55       354
      % avg / total       0.61      0.58      0.59      1496
      % Macro-Averaged F1-Score (Positive and Negative Classes): 41.46%
      % Micro-Averaged F1-Score (All Classes): 57.5535%

      % Training taboada
      % Testing taboada
      % Evaluating taboada
      % General Statistics:
      % precision    recall  f1-score   support
      % negative       0.24      0.28      0.26       212
      % neutral       0.72      0.63      0.67       930
      % positive       0.48      0.60      0.53       354
      % avg / total       0.60      0.57      0.58      1496
      % Macro-Averaged F1-Score (Positive and Negative Classes): 39.50%
      % Micro-Averaged F1-Score (All Classes): 57.1524%

      % Training musto
      % Testing musto
      % Evaluating musto
      % General Statistics:
      % precision    recall  f1-score   support
      % negative       0.26      0.31      0.28       212
      % neutral       0.70      0.63      0.67       930
      % positive       0.45      0.51      0.48       354
      % avg / total       0.58      0.56      0.57      1496
      % Macro-Averaged F1-Score (Positive and Negative Classes): 38.07%
      % Micro-Averaged F1-Score (All Classes): 55.8824%

      % Training jurek
      % Testing jurek
      % Evaluating jurek
      % General Statistics:
      % precision    recall  f1-score   support
      % negative       0.31      0.17      0.22       212
      % neutral       0.69      0.74      0.71       930
      % positive       0.40      0.42      0.41       354
      % avg / total       0.57      0.59      0.57      1496
      % Macro-Averaged F1-Score (Positive and Negative Classes): 31.55%
      % Micro-Averaged F1-Score (All Classes): 58.6230%

      % Training kolchyna
      % Testing kolchyna
      % Evaluating kolchyna
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.41      0.18      0.25       354
      % negative       0.27      0.15      0.19       212
      % neutral       0.66      0.88      0.76       930
      % avg / total       0.55      0.61      0.56      1496
      % Macro-Averaged F1-Score (Positive and Negative Classes): 21.78%
      % Micro-Averaged F1-Score (All Classes): 60.8957%
      --Negation & 0.415 & 0.576 & 0.395 & 0.572 & 0.381 & %
        0.559 & 0.316 & 0.586 & 0.218 & \textbf{0.609}\\

        % Training taboada
        % Testing taboada
        % Evaluating taboada
        % General Statistics:
        % precision    recall  f1-score   support
        % negative       0.25      0.28      0.26       212
        % neutral       0.73      0.63      0.67       930
        % positive       0.48      0.61      0.54       354
        % avg / total       0.60      0.58      0.58      1496
        % Macro-Averaged F1-Score (Positive and Negative Classes): 39.95%
        % Micro-Averaged F1-Score (All Classes): 57.5535%

        % Training jurek
        % Testing jurek
        % Evaluating jurek
        % General Statistics:
        % precision    recall  f1-score   support
        % negative       0.36      0.26      0.30       212
        % neutral       0.69      0.74      0.71       930
        % positive       0.40      0.40      0.40       354
        % avg / total       0.58      0.59      0.58      1496
        % Macro-Averaged F1-Score (Positive and Negative Classes): 35.18%
        % Micro-Averaged F1-Score (All Classes): 59.0241%
      --Intensification & \NA{} & \NA{} & 0.4 & 0.576 & \NA{} &  %
      \NA{} & 0.352 & 0.59 & \NA{} & \NA{}\\

      % Training taboada
      % Testing taboada
      % Evaluating taboada
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.48      0.61      0.54       354
      % negative       0.25      0.31      0.27       212
      % neutral       0.72      0.61      0.66       930
      % avg / total       0.60      0.57      0.58      1496
      % Macro-Averaged F1-Score (Positive and Negative Classes): 40.61%
      % Micro-Averaged F1-Score (All Classes): 56.6176%
      --Other Modifiers & \NA{} & \NA{} & 0.406 & 0.566 & \NA{} &  %
      \NA{} & \NA{} & \NA{} & \NA{} & \NA{}\\\bottomrule
    \end{tabular}
    \egroup{}
    \caption{Effect of polarity-changing factors on lexicon-based MLSA
      methods}\label{snt-cgsa:tbl:lex-res-ablation}
  \end{center}
\end{table}

As we can see from the results in
Table~\ref{snt-cgsa:tbl:lex-res-ablation}, various methods respond in
different ways to this ablation: For example, the scores of the
\citeauthor{Hu:04} system improve on the PotTS corpus, but degrade on
the SB10k dataset after switching off the negation handling.  The same
situation can also be observed with the analyzers of \citet{Musto:14}
and~\citet{Jurek:15}.  The classifier of~\citet{Taboada:11}, however,
benefits from this deactivation in both cases, and the approach
of~\citet{Kolchyna:15} vice versa shows a performance drop on either
dataset with the only exception being the micro-averaged \F{} on the
SB10k data, which unexpectedly improves from 0.606 to 0.609.%%   A closer
%% look at the actual results revealed that the latter change is mostly
%% due to the increased bias of this classifier towards the neutral
%% class: The recall of this orientation raised from 0.86 to 0.88 and
%% consequently pushed up the overall accuracy on all messages.

As to the intensification handling, we can see that only two
approaches (TBD and JRK) have this component at all.  As in the
previous case, the Taboada system profits from its deactivation, with
the macro- and micro-averaged \F{}-scores going up by 0.002 on PotTS
and 0.008 on the SB10k corpus.  A more varied situation is observed
with the analyzer of~\citet{Jurek:15}, whose PotTS results are
virtually unaffected by these changes, but the macro-averaged \F{}
slightly increases and the micro-averaged score slightly decreases on
the \citeauthor{Cieliebak:17}'s dataset.

Finally, ``other modifiers'' (such as irrealis and interrogative
clauses) only play a role as a polarity-changing factor in the system
of~\citet{Taboada:11} and, as we can see from the figures, do there
rather more harm than good: deactivating this part boosts the
macro-averaged \F{}-scores on PotTS and SB10k by~0.02 and~0.014
respectively.  At the same time, the micro-averaged result of this
system climbs up from 0.671 to 0.684 on the former dataset, but drops
from~0.569 to~0.566 on the latter corpus.

\subsection{Error Analysis}\label{subsec:cgsa:lex-methods:err-analysis}

In order to get a better intuition about the strengths and weaknesses
of each particular classifier, we additionally collected a set of
errors that were specific to only one of above the systems and will
discuss some of these cases here in detail.

The first such error, which was made by the system
of~\citet{Taboada:11}, is shown in
Example~\ref{snt:cgsa:exmp:taboada-error-0}.  Here, a strongly
positive tweet describing one's excitement about a technical report
was erroneously classified as neutral despite the presence of the
prototypical positive term ``gut'' (\emph{good}) in its superlative
form ``beste'' (\emph{best}).  Unfortunately, it is the degree of
comparison which becomes fatal in this case: According to the
implementation of~\citet{Taboada:11}, any superlative adjective has to
be preceded by the definite article and a verb in order to be
considered as a polar term for the final SO computation. Although the
adjective ``beste'' (\emph{best}) can fulfill the first criterion (it
immediately follows the determiner ``der'' [\emph{the}]), the lack of
the preceding verb nullifies its effect.

\begin{example}[An Error Made by the System of~\citeauthor{Taboada:11}]\label{snt:cgsa:exmp:taboada-error-0}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape Der beste Microsoft Knowledgebase-Artikel, den ich je gelesen habe.}\\
  \noindent The best Microsoft-Knowledgebase article I've ever read.\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{green3}{positive}}\\
 \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{black}{neutral*}}
\end{example}

\noindent Another error of this method is shown in
Example~\ref{snt:cgsa:exmp:taboada-error-1}.  This time, the presence
of the colloquial term ``verarschen'' (\emph{to hoax}) suggests that
the tweet at hand is negative.  Alas, the occurrence of another verb
(``wollt'' [\emph{wanna}]) is interpreted as an irrealis clue, which
prevents further SO computation and leads to a zero score to the whole
message.\footnote{Please note that the occurrence of the question mark
  does not affect the sentiment score because ``?!'' is not included
  in the list of valid punctuation marks in the original
  implementation.}

\begin{example}[An Error Made by the System of~\citeauthor{Taboada:11}]\label{snt:cgsa:exmp:taboada-error-1}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape Die Konklave w\"ahlt den Papst und dann sagen sie Gott war es --- Wollt ihr mich verarschen ?!}\\
  \noindent The conclave elects the Pope and then they say it was God
  --- do you wanna hoax me ?!\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{midnightblue}{negative}}\\
 \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{black}{neutral*}}
\end{example}

At this point, we already can see that the main flaws of the TBD
approach apparently stem from its overly coarse rules, which, in
addition, are not always valid in German, whose word order is
significantly laxer than English syntax.\footnote{In order to check
  this claim, we tried to temporarily deactivate the above two
  heuristics (predicate check for superlative adjectives and irrealis
  blocking by model verbs) and recomputed the scores of this system,
  getting in both cases an improvement by almost one percent on either
  corpus.}

Returning back to our error analysis, let us look at another erroneous
case shown in Example~\ref{snt:cgsa:exmp:musto-error-0}. This time,
the system of~\citet{Musto:14} incorrectly assigned the neutral label
to a positive tweet even though the positive term ``gut''
(\emph{good}) again appears in this message.  As it turns out, the
occurrence of this word is still insufficient for the classifier to
predict the positive class although this term has the highest possible
positive score in the lexicon (1.0), which is additionally boosted by
a factor of 1.5, since this word is an adjective.  But the crushing
factor in this case is the length of the tweet: since this approach
relies on the average SO-score for all words in a sentence, the value
1.5 of the only positive term is divided by 7 (the length of the
sentence) and drops down to 0.214, which is below the threshold for
the positive class (0.267).

\begin{example}[An Error Made by the System of~\citeauthor{Musto:14}]\label{snt:cgsa:exmp:musto-error-0}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
    Mensch Meier, Mensch Meier! Das sieht gut aus f\"ur
    die \%User:}\\
  \noindent Gosh Meier, Gosh Meier! It looks good for
  the \%User:\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{green3}{positive}}\\
 \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{black}{neutral*}}
\end{example}

\noindent As it turns out, this kind of mistakes is by far the most
common type of errors characteristic to the MST system.  Further
examples of such incorrect decisions are provided in
Example~\ref{snt:cgsa:exmp:musto-error-1}:
\begin{example}[Errors Made by the System of~\citeauthor{Musto:14}]\label{snt:cgsa:exmp:musto-error-1}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
    Der \%User tut echt geile musik machen. Nichts mit Boyband hier.}\\
  \noindent The \%User is making really great music. Nothing with Boyband here.\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{green3}{positive}}\\
 \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{black}{neutral*}}\\[2\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
    Diese S5E5 Episode mit den Zug\"uberfall war wieder genial! BreakingBad}\\
  \noindent This S5E5 episode with train robbery was brilliant again! BreakingBad\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{green3}{positive}}\\
 \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{black}{neutral*}}
\end{example}

A different kind of problems is experienced by the approach
of~\citet{Jurek:15}, which apparently has difficulties with correctly
predicting the positive class.  A deeper analysis of its
misclassifications revealed that the reason for it is relatively
simple: Because this classifier uses conditional probabilities of
polar terms instead of their original lexicon scores, and we have
estimated these probabilities on the noisily labeled German Twitter
Snapshot, which was extremely biased towards the positive class (see
Table~\ref{snt-cgsa:tbl:corp-dist}), all positive lexicon entries
received extremely high scores.  As a consequence, even a single
occurrence of a positive term in a message outweighed the effect of
any negative expressions, even if they were more frequent in that
tweet.  This is, for instance, the case in
Example~\ref{snt:cgsa:exmp:jurek-error-0} where the score of the
(questionable) positive expression ``Normal'' (\emph{normally}) is
greater than the absolute sum of two negative values for the terms
``sich beschweren'' (\emph{to complain}) and ``ekelhaft''
(\emph{disgusting}).

\begin{example}[An Error Made by the System of~\citeauthor{Jurek:15}]\label{snt:cgsa:exmp:jurek-error-0}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
    Normal bin ich ja nicht der mensch dwer sich beschwert wegen dem
    essen aber diese Pizza von Joeys\ldots{} boah wie ekelhaft}\\
  \noindent Normally I'm not a person who complains about food but
  this pizza from Joeys\ldots{} Boah it's so disgusting\\[0.65em]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{midnightblue}{negative}}\\
 \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{green3}{positive*}}\\
\end{example}

%% Kolchyna
The same problem also afflicts the system of~\citeauthor{Kolchyna:15},
whose error example is given in~\ref{snt:cgsa:exmp:kolchyna-error-0}.
In contrast to the previous approaches, which mainly rely on manually
designed heuristic rules, this method makes its decisions using a
trained $k$-NN classifier.  Nevertheless, its prediction in the
provided case is still incorrect as it evidently confuses the positive
class with the neutral polarity.

\begin{example}[An Error Made by the System of~\citeauthor{Kolchyna:15}]\label{snt:cgsa:exmp:kolchyna-error-0}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape das
    H\"ort sich echt Super an! \%PosSmiley macht sami nicht auch so
    ein Video? Noah s\"usse beste Freunde! \heart{} \%User isilie
    saminator}\\
  \noindent It sounds really fantastic! \%PosSmiley won't sami also
  make such a video? Noah's sweet best friends! \heart{} \%User isilie
  saminator\\[0.65em]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{green3}{positive}}\\
 \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{black}{neutral*}}\\
\end{example}

\noindent In order to understand the reason for this
misclassification, we first looked at the initial SO scores computed
by the Kolchyna analyzer.  As it turned out, both values that were
used by the internal $k$-NN predictor of this system as features (the
average SO score of all polar terms found in the message and the
logarithm of this average) were relatively high, amounting to 33.42
and 2.52 respectively.  But a closer look at the selected nearest
neighbors revealed that even despite such high SO values, top three of
the closest neighbors of this microblog were indeed neutral, as we can
see from the list below:

\begin{enumerate}
\item \textcolor{darkred}{\bfseries Tweet:} ``Not in my backyard''
  -Mentalit\"at dt. Politik: ``N\"achster Castor geht wohl doch nach
  Gorleben\ldots{} -\%Link antiatom''\\
  \textit{``Not in my backyard'' -Mentality of German politics: ``Next Castor will probably still got to Gorleben\ldots{} -\%Link antiatom''}\\
  \textcolor{darkred}{\bfseries Label:}~\textcolor{black}{neutral}\\
  \textcolor{darkred}{\bfseries Distance:}~$\expnumber{6.83}{-03}$;

\item \textcolor{darkred}{\bfseries Tweet:} Kanzlerin im Google-Hangout: ``Die Technik soll sich mal bem\"uhen''\\
  \textit{Chancellor in Google-Hangout: ``The technology should make an effort''}\\
  \textcolor{darkred}{\bfseries Label:}~\textcolor{black}{neutral}\\
  \textcolor{darkred}{\bfseries Distance:}~$\expnumber{1.6}{-02}$;\label{snt:cgsa:exmp:kolchyna-error-0.1}

\item \textcolor{darkred}{\bfseries Tweet:} Kanzlerin im
  Google-Hangout: ``Die Technik soll sich mal
  bem\"uhen''\\ \textit{Chancellor in Google-Hangout: ``The technology
    should make an effort''}\\ \textcolor{darkred}{\bfseries
    Label:}~\textcolor{black}{neutral}\\ \textcolor{darkred}{\bfseries
    Distance:}~$\expnumber{1.6}{-02}$;\footnote{Please note that this
    tweet is not a duplicate of the previous microblog, but a
    different message (with its distinct message id), which, however,
    has the same wording.}\label{snt:cgsa:exmp:kolchyna-error-0.2}

\item \textcolor{darkred}{\bfseries Tweet:} W\"unsche mir ein Format wie zdflogin auch f\"ur das \%User. Viele Themen, klare Aussagen. Sch\"ones Special \%User zur Landtagswahl! \%PosSmiley\\
  \textit{Wish \%User had a format like zdflogin. Many topics, clear statements. Nice Special \%User zur Landtagswahl! \%PosSmiley}\\
  \textcolor{darkred}{\bfseries Label:}~\textcolor{green3}{positive}\\
  \textcolor{darkred}{\bfseries Distance:}~$\expnumber{2.1}{-02}$;

\item \textcolor{darkred}{\bfseries Tweet:} Ich bin ja so gespannt ob
  die FDP im September erst den Zahn\"arzten und dann den Apothekern
  mit Geschenken dankt, oder anders rum\ldots\\ \textit{I'm so curious
    whether FDP will first give gifts to dentists and then to
    pharmacists in September, or whether it'll be vice
    versa}\\ \textcolor{darkred}{\bfseries
    Label:}~\textcolor{green3}{positive}\\ \textcolor{darkred}{\bfseries
    Distance:}~$\expnumber{4.12}{-02}$;
\end{enumerate}

\noindent Even more surprisingly, the SO scores of the neighboring
neutral instances were indeed also relatively high: In the first
microblog, for example, the system recognized two polar terms: the
English word ``Not'', which was confused with the German term ``Not''
(\emph{distress}), and ``n\"achster'' (\emph{next}).  Another polar
expression (``sich bem\"uhen'' [\emph{to make an effort}]) was found
in messages~\ref{snt:cgsa:exmp:kolchyna-error-0.1}
and~\ref{snt:cgsa:exmp:kolchyna-error-0.2}.  Although two of these
terms (``Not'' and ``sich bem\"uhen'') had a negative label in the
sentiment lexicon, their conditional probability of being associated
with the positive class was more than ten times bigger than the chance
to appear in a negative microblog (according to the computed
statistics). As a consequence of this positive probability bias, many
neutral tweets from the training set ended up in close vicinity to
actual positive examples.

As we can see, lexicon-based methods experience various kinds of
problems with predicting the polarity of short casually written
microblogs: Some of these systems apply rules that are too specific to
a particular language and domain, so that they do not generalize well
to German tweets; others rely on noisy statistics, which might be
extraordinarily skewed towards just one polarity.  Now, we should
check whether other approaches to the message-level sentiment analysis
(which rely on completely different principles and paradigms) will
also be susceptible to these kinds of errors.

\section{Machine-Learning Methods}\label{sec:cgsa:ml-based}

Despite their immense popularity, linguistic plausibility, and
simplicity to implement, lexicon-based approaches often have been
criticized for the rigidness of their classification\footnote{Since
  these systems only rely on the precomputed weights of lexicon
  entries, considering these coefficients as constant, their decision
  boundaries frequently appear to be suboptimal as many terms might
  have different polarity and intensity values depending on the
  domain~\cite[see][]{Eisenstein:17,Yang:17}.} and the inability to
incorporate additional, non-lexical attributes into their final
decisions.  Moreover, as noted by~\citet{Pang:02} and also confirmed
empirically by~\citet{Riloff:03} and \citet{Gamon:04}, many linguistic
expressions that actually correlate with the subjectivity and polarity
of a sentence (\eg{} exclamation marks or spelling variations) are
very unlikely to be included into a sentiment lexicon even by a human
expert.  As a consequence of this, with the emergence of manually
annotated corpora, lexicon-based systems have been gradually
superseded by supervised machine-learning techniques.

One of the first steps in this direction was taken
by~\citet{Wiebe:99}, who used a Na\"{\i}ve Bayes classifier to
differentiate between subjective and objective statements.  Using
binary features that reflected the presence of a pronoun, an
adjective, a cardinal number, or a modal verb in the analyzed
sentence, the authors achieved an accuracy of~72.17\% on the two-class
prediction task (differentiating between positive and negative
classes), outperforming the majority class baseline by more than 20\%.
An even better result (81.5\%) could be reached when the dataset was
restricted only to the examples with the most confident annotation.

Inspired by this success,~\citet{Yu:03} presented a more elaborated
system in which they first distinguished between subjective and
objective documents, then differentiated between polar and neutral
sentences, and, finally, determined the polarity of that clauses.  As
in the previous case, the authors used a Na\"{\i}ve Bayes predictor
for the document-level task, reaching a remarkable \F-score of~0.96 on
this objective; and applied an ensemble of NB systems to predict the
subjectivity of single sentences.  To determine the semantic
orientation of subjective clauses, \citeauthor{Yu:03} averaged the
polarity scores of their tokens, obtaining these scores from an
automatically constructed sentiment lexicon~\cite{Hatzivassi:97}.
This way, they attained an accuracy of~91\% on a set of 38 sentences
that had a perfect inter-annotator agreement.

%% Yet another multi-stage Na{\"i}ve Bayes model was proposed by
%% \citet{Pang:04}, who tried to classify the overall semantic
%% orientation of movie reviews (positive vs. negative) by first dividing
%% the sentences into subjective and objective ones (attaining 92\%
%% accuracy on this subtask) and then predicting the overall polarity of
%% a review using only subjective passages.  With this architecture, the
%% authors achieved a statistically significant improvement over the
%% baseline method (in which they traditionally predicted the polarity of
%% the text using all of its sentences), boosting the accuracy of SO
%% classification from~82.8 to~86.4\%.

In order to check the effectiveness of the Na\"{\i}ve Bayes approach,
\citet{Pang:02} compared the results of NB, MaxEnt, and SVM systems on
the movie review classification task, trying to predict whether a
review was perceived as thumbs up or thumbs down.  In contrast to the
previous works, they found the SVM classifier working best for this
objective, yielding 82.9\% accuracy when used with unigram features
only.  This conclusion paved the way for the following triumph of the
support-vector approach, which was dominating the whole sentiment
research field for almost a decade ever since.  For example,
\citet{Gamon:04} also trained an SVM predictor using a set of
linguistic and surface-level features (including part-of-speech
trigrams, context-free phrase-structure patterns, and part-of-speech
information coupled with syntactic relations) to distinguish between
positive and negative customer feedback, achieving 77.5\% accuracy and
$\approx$0.77~\F{} by using only top 2,000 attributes that had the
highest log-likelihood ratio with the target class.
% Interestingly enough, \citeauthor{Gamon:04} also could obtain quite
% competitive figures (74.5\% accuracy) by using linguistically
% motivated features only.
Furthermore, \citet{Pang:05} addressed the problem of multi-class
rating, attempting to predict the number of stars assigned to a
review.  For this purpose, they compared three different SVM types:
\begin{inparaenum}[(i)]
\item one-versus-all SVM (OVA-SVM),
\item SVM regression,
\item and OVA-SVM with metric labeling;
\end{inparaenum}
getting their best results ($\approx$52\%~accuracy) with the last
option.
% \citet{Pang:05}
% In the last approach, in addition to maximizing the score of the
% correct labels, the authors also explicitly encoded the objective of
% minimizing the absolute difference between the predicted labels of
% similar training examples (measuring this similarity with the
% percentage of positive sentences).  This strategy brought
% statistically significant improvements over the first two baselines,
% yielding an average accuracy of $\approx$52\%.
Finally, \citet{Ng:06} proposed a multi-stage SVM system, in which
they first classified whether the given text was a review or not and
then tried to predict its polarity.  Due to a better usage of
higher-order $n$-grams (where, instead of na\"{\i}vely considering all
token sequences up to length $n$ as new features, the authors only
took 5,000 most useful ones), \citet{Ng:06} even improved the state of
the art on the \citeauthor{Pang:04}'s corpus, boosting the
classification accuracy from~87.1 to~90.5\%.

% \todo[inline]{Feature Selection}
% \done[inline]{\citet{Li:10b}}

% \citet{Li:10b} addressed the problem of polarity shifting using
% machine-learning techniques.  For this purpose, the authors first
% selected most frequent and indicative features of the two main
% polarity classes (positive and negative), and then culled training
% instances containing these attributes whose labels, however, were
% different from the ones sugested by the features.  After obtaining
% this polarity shifted subset, \citeauthor{Li:10b} trained several
% linear support-vector classifiers, one of which had to distinguish
% between polarity-shifted and polarity-preserving sentences, the other
% two were to classify the semantic orientation of these two groups
% (\ie{} one system had to predict the polarity of shifted instances,
% and the other one had to determine the semantic orientation of
% polarity-preserving ones), and the last one was trained on the
% complete original dataset---the product review corpus
% of~\citet{Blitzer:06}---again to predict the polarity of complete
% sentences disregarding their possible polarity shifts.  The authors
% achieved their best results~(80,9\% average accuracy) using a
% combination of the last three systems with a special meta-classifier
% joining their single decisions.

% \done[inline]{\citet{Wiebe:05a}}

% A semi-supervised approach to sentence-level subjectivity prediction
% was proposed by~\citet{Wiebe:05a}.  Using an existing rule-based
% sentiment system, the authors classified a large set of unlabeled
% sentences from newspaper articles into subjective and objective ones,
% achieving 34.2\% recall and 90.4\% precision for the former class and
% getting 30.7\% recall and 82.4\% precision for the latter orientation.
% Afterwards, with the help of the AutoSlog-TS
% algorithm~\cite{Riloff:96}, they extracted frequent lexico-syntactic
% patterns that strongly correlated with the subjectivity of a sentence,
% and trained a Na{\"i}ve Bayes system, considering the occurrences of
% the learned patterns as features.  In an attempt to improve the
% results of this approach even further, \citeauthor{Wiebe:05a} also
% implemented a self-improvement strategy in which they relabeled the
% original unannotated dataset with the obtained NB classifier, then
% took one half of the most confidently classified sentences as a new
% training set, and extracted new subjectivity patterns, retraining the
% Na{\"i}ve Bayes module on the updated set of features.  With this
% final classifier, the authors attained an accuracy of 73.8\% on
% predicting the subjectivity of sentences in the MPQA
% corpus~\cite{Wiebe:05}, coming close to the results achieved by a
% fully supervised ML approach (76\%).

% \done[inline]{\citet{Riloff:06}}

% \citet{Riloff:06} addressed the problem of redundant features,
% hypothesizing that correlated attributes (such as unigram
% ``\emph{happy}'' and bigram ``\emph{very happy}'') would rather harm
% the performance of a classifier than improve its accuracy.  To prove
% this hypothesis, the authors defined two kinds of redundancy relations
% which could exist between intersecting input traits:
% \emph{representational} and \emph{behavioral} subsumption.  The former
% former type was assumed to hold between features~$A$ and $B$ if all
% occurrences of the attribute~$A$ were a strict superset of the
% occurrences of~$B$.  The behavioral subsumption meant that the
% information gain (IG) \cite{Forman:03} of the feture $B$ was at most
% some negligible value $\delta$ lower than the respective IG score of
% the attribute $A$.  In their experiments, \citeauthor{Riloff:06}
% observed that the accuracy of an SVM classifier indeed increased (from
% 81.7 to 82.7\% on the IMDBP dataset and from 74.4 to 74.9\% on the
% MPQA corpus) after they excluded all attributes that were
% representationally and behaviorally subsumed by other features.  These
% improvements even outperformed the gains that could be achieved with
% traditional feature selection methods.

% \todo[inline]{Higher-order $n$-grams}

% \done[inline]{\citet{Ng:06}}

% Further on, \citet{Ng:06} simultaneously addressed two classification
% problems: distinguishing whether a given text snippet was a review or
% not and determining the polarity of the review.  The authors attained
% impressive results (99.8\% accuracy) for the former task, using only
% SVM with unigram features.  Moreover, they also outperformed the then
% state of the art on the \citet{Pang:04}'s corpus, boosting the
% accuracy from~87.1 to~90.5\%.  These changes were mostly due to a
% smarter use of higher-order $n$-grams, where, instead of bluntly
% considering all token sequences up to the order $n$ as new features,
% \citet{Ng:06} only took 5,000 most useful ones, measuring their
% utility with the weightes log-likelihood ratio \cite{Nigam:00}.

% \done[inline]{\citet{Mejova:11}}

% \citet{Mejova:11} investigated the effect of different features on
% various datasets---the movie review corpus of~\citet{Pang:04}, the
% product reviews gathered by~\citet{Jindal:07}, and the customer
% feedback dataset of~\citet{Blitzer:06}, coming to the conclusion that,
% in general, preserving the original form of tokens (\ie{} keeping the
% original token forms instead of lemmas) and using their frequency
% scores instead of binary values was beneficial to the results on all
% test sets.  The use of different $n$-gram lengths, however, had a
% mixed effect with the best scores typically yielded by the union of
% uni-, bi-, and tri-gram features.  Last but not least, they found the
% negation heuristics proposed by~\citet{Das:01} (adding the
% \texttt{\_NOT} suffix to all tokens following a negation up to the
% first punctuation) leading to only marginal improvements.  The authors
% achieved their best results (87.5\%, 94.7\%, and 89.6\% accuracy on
% the datasets of \citet{Pang:04}, \citet{Jindal:07}, and
% \citet{Blitzer:06}, respectively) with the union of unnormalized uni-,
% bi-gram and tri-gram features without negation when using term
% frequencies as feature values.

% \done[inline]{\citet{Riloff:03a}}

% \citet{Riloff:03a} addressed the problem of inusfficient manually
% annotated sentiment resources by proposing a bootstrapping method for
% training a subjectivity classifier.  In this approach, the authors
% first applied two high-precision predictors to a large collection of
% unlabeled sentences in order to get an initial set of subjective and
% objective instances.  Afterwards, they used the AutoSlog-TS
% algorithm~\cite{Riloff:96} to extract expressions which strongly
% correlated with the subjective class and employed these phrases to
% classify the remaining sentences from the corpus.
% \citeauthor{Riloff:03a} repeated the last two steps (pattern
% extraction and expansion of the training set) multiple times to
% transitively learn new subjective phrases.  With the final system, the
% authors achieved a precision of 0.902 on recognizing polar sentences,
% with their recall running up to 0.401.

% \done[inline]{\citet{Wilson:04,Wilson:06}}

% A related problem, namely that of classifying the strength of
% opinions, was addressed by~\citet{Wilson:04,Wilson:06}.  In
% particular, the authors proposed a wide variety of linguistic
% features (including automatically learned lexico-syntactic patterns
% similar to the ones used by~\citet{Riloff:03}, bags of words, and
% syntactic attributes such as lemma and PoS tag of the root of a
% dependency tree, lemmas and tags of its intermediate nodes and
% leaves, lexicalized relation tuples, \ie{} tuples consisting of the
% lemma of a parent node, its grammatical relation to the child, and
% the lemma of the child itself, etc.), checking the utility of these
% attributes with three different classifiers:
% BoosTexter~\cite{Schapire:00}, Ripper~\cite{Cohen:95}, and
% SVMLight~\cite{Joachims:99}.  \citet{Wilson:04} achieved their best
% results (55\%~accuracy and 0.991~mean squared error) with the
% BoosTexter approach when using all the introduced features.

But a real game change in the MLSA research field happened with the
introduction of the SemEval shared task on sentiment analysis in
Twitter~\cite{Nakov:13}.  Starting from its inaugural run
in~\citeyear{Nakov:13}, this competition has rapidly caught the
attention of the broader NLP community and has been rerun five times,
attracting more than 40 active participants every year.

It is not surprising that the first winning systems in this task
closely followed in the footsteps of the advances in the general
opinion mining at that time.  For example, the two top-scoring
submissions in the initial iteration~\cite{Mohammad:13,Guenther:13}
both relied on the SVM algorithm: The first of these approaches, an
analyzer developed by~\citet{Mohammad:13}, was the absolute winner of
SemEval~2013, scoring impressive 0.69 macro-averaged two-class \F{} on
the provided Twitter corpus.  The key to the success of this method
was an extensive set of linguistic features devised by the authors,
which included character and token $n$-grams, Brown
clusters~\cite{Brown:92}, statistics on part-of-speech tags,
punctuation marks, elongated words etc.  But the most useful type of
attributes according to the feature ablation test turned out to be the
features that reflected information from various sentiment
lexicons. In particular, depending on the type of the polarity list
from which such information was extracted, \citeauthor{Mohammad:13}
introduced two types of lexicon attributes: \emph{manual} and
\emph{automatic} ones.  The former group was computed with the help of
the NRC emotion lexicon~\cite{Mohammad:13a}, MPQA polarity
list~\cite{Wilson:05}, and Bing Liu's manually compiled polarity
set~\cite{Hu:04}.  For each of these resources and for each of the
non-neutral polarity classes (positive and negative), the authors
estimated the total sum of the lexicon scores for all message tokens
and also separately calculated these statistics for each particular
part-of-speech tag, considering them as additional attributes.
Automatic features were obtained using the Sentiment140 and Hashtag
Sentiment Base polarity lists~\cite{Kiritchenko:14}.  Again, for each
of these lexicons, for each of the two polarity classes, the authors
produced four features representing the number of tokens with non-zero
scores, the sum and the maximum of all respective lexicon values for
all words, and the score of the last term in the tweet.  These two
feature groups (manual and automatic lexicon attributes) improved the
macro-averaged \F{}$^{+/-}$-score by almost five percent,
outperforming in this regard all other traits.

Another notable submission, the system of~\citet{Guenther:13}, also
relied on a linear SVM predictor with a rich set of features.
Like~\citet{Mohammad:13}, the authors used original and lemmatized
unigrams, word clusters, and lexicon features.  But in contrast to the
previous approach, this application utilized only one polarity
list---that of~\citet{Esuli:05}.  Partially due to this fact,
\citeauthor{Guenther:13} found the word clusters working best among
all features.  This method also yielded competitive results
(0.653~\F{}$^{+/-}$) on the message-level polarity task, attaining
second place in that year.

Later on, \citet{Guenther:14} further improved their results (from
0.653 to 0.691 two-class \F) by extending the original system with a
Twitter-aware tokenizer~\cite{Owoputi:13}, spelling normalization
module, and a significantly increased set of lexicon-based features.
In particular, instead of simply relying on
\textsc{SentiWordNet}~\cite{Esuli:05}, \citeauthor{Guenther:14}
applied a whole ensemble of various polarity lists including Liu's
opinion lexicon, MPQA subjectivity list, and TwittrAttr polarity
resource.  As mentioned by the authors, the last change was of
particular use to the classification accuracy, improving the
macro-\F{}$^{+/-}$ by almost four percent.

An even better score on this task, could be attained with the approach
of~\citet{Miura:14}, who also utilized a supervised ML classifier with
character and word $n$-grams, word clusters, disambiguated senses, and
lexicon scores of message tokens as features.  Similarly to the
systems of~\citet{Mohammad:13} and \citet{Guenther:14}, the authors
made heavy use of various kinds of polarity lists including
AFINN-111~\cite{Nielsen:11}, Liu's Opinion Lexicon~\cite{Hu:04},
General Inquirer~\cite{Stone:66}, MPQA Polarity List~\cite{Wiebe:05a},
NRC Hashtag and Sentiment140 Lexicon~\cite{Mohammad:13}, as well as
\textsc{SentiWordNet}~\cite{Esuli:06a}, additionally applying a whole
set of preprocessing steps such as spelling correction, part-of-speech
tagging with lemmatization, and a special weighting scheme for
underrepresented classes.  Due to these enhancements, combined with a
carefully tuned LogLinear classifier, \citet{Miura:14} were able to
boost the sentiment classification results on the SemEval~2014 test
set to~0.71~\F{}$^{+/-}$.

In order to see how this family of methods would perform on our
Twitter corpora, we have reimplemented the approaches of
\citet{Gamon:04}, \citet{Mohammad:13}, and \citet{Guenther:14} with
the following modifications: In the system of~\citet{Gamon:04}, we
used the available dependency analyses from the
\texttt{MateParser}~\cite{Bohnet:09} instead of constituency trees,
considering each node of the dependency tree as a syntactic
constituent and regarding the two-tuple
(\texttt{dependency-link-to-the-parent}, \texttt{node's-PoS-\-tag}) as
the name of that constituent (for example, a finite verb at the root
of the tree was mapped to the constituent (\texttt{--},
\texttt{VVFIN}), where \texttt{--} is the name of the root relation).
Furthermore, because the Brown clusters were not available for German,
we had to remove this attribute altogether from the feature sets of
\citeauthor{Mohammad:13}'s and \citeauthor{Guenther:14}'s methods.
Moreover, because the former system relied on two types of lexicon
attributes---manual and automatic ones, we used two polarity lists for
these approaches: the Zurich Sentiment Lexicon of~\citet{Clematide:10}
as a manual resource and our Linear Projection Lexicon, which was
introduced in Chapter~\ref{chap:snt:lex}, as an automatically
generated polarity list.  All remaining attributes and training
specifics were kept maximally close to their original descriptions.

\begin{table}[h]
  \begin{center}
    \bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
    \begin{tabular}{p{0.162\columnwidth} % first columm
        *{9}{>{\centering\arraybackslash}p{0.074\columnwidth}} % next nine columns
        *{2}{>{\centering\arraybackslash}p{0.068\columnwidth}}} % last two columns
      \toprule
      \multirow{2}*{\bfseries Method} & %
      \multicolumn{3}{c}{\bfseries Positive} & %
      \multicolumn{3}{c}{\bfseries Negative} & %
      \multicolumn{3}{c}{\bfseries Neutral} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Macro\newline \F{}$^{+/-}$} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Micro\newline \F{}}\\
      \cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}

      & Precision & Recall & \F{} & %
      Precision & Recall & \F{} & %
      Precision & Recall & \F{} & & \\\midrule

      \multicolumn{12}{c}{\cellcolor{cellcolor}PotTS}\\

       % Gamon Commands:
       % ------------------
       % cgsa_sentiment train -t gamon -g -l cgsa/data/lexicons/zrch.manual.txt
       % -l cgsa/data/lexicons/linproj.word2vec.kim_hovy_seedset.auto.txt
       % data/PotTS/preprocessed/{train,dev}/*.tsv
       %
       % Best parameters for gamon: {'clf__C': 0.01}
       %
       % cgsa_sentiment test  -m cgsa/data/models/cgsa.model \
       % data/PotTS/preprocessed/test/*.tsv > \
       % data/PotTS/preprocessed/predicted/gamon/gamon.zrch.linproj.test
       %
       % cgsa_evaluate data/SB10k/preprocessed/test/ \
       % data/PotTS/preprocessed/predicted/gamon/gamon.zrch.linproj.test
       %
       % General Statistics:
       % precision    recall  f1-score   support
       % positive       0.67      0.73      0.70       680
       % negative       0.35      0.15      0.21       287
       % neutral       0.60      0.72      0.66       558
       % avg / total       0.59      0.62      0.59      1525
       % Macro-Averaged F1-Score (Positive and Negative Classes): 45.30%
       % Micro-Averaged F1-Score (All Classes): 61.7049%

       GMN & 0.67 & 0.73 & 0.7 & %
       0.35 & 0.15 & 0.21 & %
       0.6 & 0.72 & 0.66 & %
       0.453 & 0.617\\

       % Mohammad Commands:
       % ------------------
       % cgsa_sentiment train  -t mohammad -g -l cgsa/data/lexicons/zrch.manual.txt \
       % -l cgsa/data/lexicons/linproj.word2vec.kim_hovy_seedset.auto.txt \
       % data/PotTS/preprocessed/{train,dev}/*.tsv
       %
       % Best C: 0.01
       %
       % cgsa_sentiment test  -m cgsa/data/models/cgsa.model \
       % data/PotTS/preprocessed/test/*.tsv > \
       % data/PotTS/preprocessed/predicted/mohammad/mohammad.zrch.linproj.test
       %
       % cgsa_evaluate data/PotTS/preprocessed/test/ \
       % data/PotTS/preprocessed/predicted/mohammad/mohammad.zrch.linproj.test
       %
       % General Statistics:
       % precision    recall  f1-score   support
       % positive       0.79      0.77      0.78       680
       % negative       0.58      0.56      0.57       287
       % neutral       0.73      0.76      0.74       558
       % avg / total       0.73      0.73      0.73      1525
       % Macro-Averaged F1-Score (Positive and Negative Classes): 67.41%
       % Micro-Averaged F1-Score (All Classes): 72.7213%
       MHM & \textbf{0.79} & 0.77 & \textbf{0.78} & %
       \textbf{0.58} & \textbf{0.56} & \textbf{0.57} & %
       \textbf{0.73} & \textbf{0.76} & \textbf{0.74} & %
       \textbf{0.674} & \textbf{0.727}\\

       % Guenther Commands:
       % ------------------
       % cgsa_sentiment train -t guenther -g -l cgsa/data/lexicons/zrch.manual.txt
       % -l cgsa/data/lexicons/linproj.word2vec.kim_hovy_seedset.auto.txt
       % data/PotTS/preprocessed/{train,dev}/*.tsv
       %
       % Best parameters for guenther: l1_ratio: 0.01, clf__alpha: 0.0001
       %
       % cgsa_sentiment test  -m cgsa/data/models/cgsa.model \
       % data/PotTS/preprocessed/test/*.tsv > \
       % data/PotTS/preprocessed/predicted/guenther/guenther.zrch.linproj.test
       %
       % cgsa_evaluate data/SB10k/preprocessed/test/ \
       % data/PotTS/preprocessed/predicted/guenther/guenther.zrch.linproj.test
       %
       % General Statistics:
       % precision    recall  f1-score   support
       % positive       0.71      0.80      0.75       680
       % negative       0.55      0.45      0.50       287
       % neutral       0.68      0.63      0.65       558
       % avg / total       0.67      0.67      0.67      1525
       % Macro-Averaged F1-Score (Positive and Negative Classes): 62.41%
       % Micro-Averaged F1-Score (All Classes): 67.3443%

       GNT & 0.71 & \textbf{0.8} & 0.75 & %
       0.55 & 0.45 & 0.5 & %
       0.68 & 0.63 & 0.65 & %
       0.624 & 0.673\\

      \multicolumn{12}{c}{\cellcolor{cellcolor}SB10k}\\

       % Gamon Commands:
       % ------------------
       % cgsa_sentiment train -t gamon -g -l cgsa/data/lexicons/zrch.manual.txt
       % -l cgsa/data/lexicons/linproj.word2vec.kim_hovy_seedset.auto.txt
       % data/SB10k/preprocessed/{train,dev}/*.tsv
       %
       % Best parameters for gamon: {'clf__C': 0.01}
       %
       % cgsa_sentiment test  -m cgsa/data/models/cgsa.model \
       % data/SB10k/preprocessed/test/*.tsv > \
       % data/SB10k/preprocessed/predicted/gamon/gamon.zrch.linproj.test
       %
       % cgsa_evaluate data/SB10k/preprocessed/test/ \
       % data/SB10k/preprocessed/predicted/gamon/gamon.zrch.linproj.test
       %
       % General Statistics:
       % precision    recall  f1-score   support
       % positive       0.65      0.45      0.53       354
       % negative       0.38      0.08      0.13       212
       % neutral       0.72      0.93      0.81       930
       % avg / total       0.65      0.70      0.65      1496
       % Macro-Averaged F1-Score (Positive and Negative Classes): 32.88%
       % Micro-Averaged F1-Score (All Classes): 69.8529%

      GMN & 0.65 & 0.45 & 0.53 & %
       0.38 & 0.08 & 0.13 & %
       0.72 & \textbf{0.93} & 0.81 & %
       0.329 & 0.699\\

       % Mohammad Commands:
       % ------------------
       % cgsa_sentiment train  -t mohammad -g -l cgsa/data/lexicons/zrch.manual.txt \
       % -l cgsa/data/lexicons/linproj.word2vec.kim_hovy_seedset.auto.txt \
       % data/SB10k/preprocessed/{train,dev}/*.tsv
       %
       % Best C: 0.01
       %
       % cgsa_sentiment test  -m cgsa/data/models/cgsa.model \
       % data/SB10k/preprocessed/test/*.tsv > \
       % data/SB10k/preprocessed/predicted/mohammad/mohammad.zrch.linproj.test
       %
       % cgsa_evaluate data/SB10k/preprocessed/test/ \
       % data/SB10k/preprocessed/predicted/mohammad/mohammad.zrch.linproj.test
       %
       % General Statistics:
       % precision    recall  f1-score   support
       % positive       0.71      0.65      0.68       354
       % negative       0.51      0.40      0.45       212
       % neutral       0.80      0.87      0.84       930
       % avg / total       0.74      0.75      0.74      1496
       % Macro-Averaged F1-Score (Positive and Negative Classes): 56.36%
       % Micro-Averaged F1-Score (All Classes): 75.2005%

       MHM & \textbf{0.71} & \textbf{0.65} & \textbf{0.68} & %
        \textbf{0.51} & \textbf{0.4} & \textbf{0.45} & %
        \textbf{0.8} & 0.87 & \textbf{0.84} & %
        \textbf{0.564} & \textbf{0.752}\\

       % Guenther Commands:
       % ------------------
       % cgsa_sentiment train -t guenther -g -l cgsa/data/lexicons/zrch.manual.txt
       % -l cgsa/data/lexicons/linproj.word2vec.kim_hovy_seedset.auto.txt
       % data/SB10k/preprocessed/{train,dev}/*.tsv
       %
       % Best parameters for guenther: l1_ratio: 0.01, clf__alpha: 0.0001
       %
       % cgsa_sentiment test  -m cgsa/data/models/cgsa.model \
       % data/SB10k/preprocessed/test/*.tsv > \
       % data/SB10k/preprocessed/predicted/guenther/guenther.zrch.linproj.test
       %
       % cgsa_evaluate data/SB10k/preprocessed/test/ \
       % data/SB10k/preprocessed/predicted/guenther/guenther.zrch.linproj.test
       %
       % General Statistics:
       % precision    recall  f1-score   support
       % positive       0.67      0.62      0.64       354
       % negative       0.44      0.28      0.34       212
       % neutral       0.78      0.87      0.82       930
       % avg / total       0.70      0.72      0.71      1496
       % Macro-Averaged F1-Score (Positive and Negative Classes): 49.11%
       % Micro-Averaged F1-Score (All Classes): 72.3930%

       GNT & 0.67 & 0.62 & 0.64 & %
       0.44 & 0.28 & 0.34 & %
       0.78 & 0.87 & 0.82 & %
       0.491 & 0.724\\\bottomrule
    \end{tabular}
    \egroup{}
    \caption[Results of ML-based MLSA methods]{ Results of
      machine-learning--based MLSA methods\\ {\small
        GMN~--~\citet{Gamon:04}, MHM~--~\citet{Mohammad:13},
        GNT~--~\citet{Guenther:14}}}\label{snt-cgsa:tbl:ml-res}
  \end{center}
\end{table}

The results of our reimplementations are shown in
Table~\ref{snt-cgsa:tbl:ml-res}.  As we can see from the scores, the
system of~\citet{Mohammad:13} clearly dominates its competitors on
both corpora. This holds for all presented metrics except for the
recall of positive tweets on the PotTS dataset and neutral messages on
the SB10k data, where it is outperformed by the analyzers of
\citet{Guenther:14} and \citet{Gamon:04} respectively.  In any other
respect, however, the results of the MHM classifier are notably higher
than those of the GNT method, sometimes surpassing it by up to 12\%
(this is, for instance, the case for the recall of negative microblogs
on the SB10k corpus).  This margin becomes even larger if we compare
the scores of Mohammad's system with the performance of
\citeauthor{Gamon:04}'s predictor, which is by far the weakest ML
method in this survey.  This weakness, however, is less surprising
regarding the fact that \citeauthor{Gamon:04}'s approach is purely
grammar-based and relies only on information about part-of-speech tags
and constituency parses without any lexicon traits or even plain
$n$-gram features.  Partially due to these limited input attributes,
the results of this analyzer are even worse than the average scores of
lexicon-based methods.

% A different approach to the message-level sentiment-analysis task was
% proposed by~\citet{Hagen:15}, who, instead of developing their own
% method from scratch, united four already existing solutions
% (\citet{Mohammad:13}, \citet{Guenther:13}, \citet{Proisl:13}, and
% \citet{Miura:14}) into a single ensemble, taking the average of the
% predicted scores as the final decision of the complete system.  This
% way, the authors achieved 0.648~\F{} on the SemEval-2015 test set,
% attaining first place among 40 participants.

% \done[inline]{\citet{Hamdan:15}}

% Another supervised machine-learning system was proposed
% by~\citet{Hamdan:15}, who also used an extensive set of features such
% as word $n$-grams, negation existence, sentiment lexicons and
% $Z$-score, which reflected the strength of statistical correlation
% between a given term $t$ and the target class $c$ in the distribution
% \cite[cf.][]{Hamdan:14}.  Similarly to~\citet{Mohammad:13}, the
% authors used an extensive set of lexicon features, and also applied
% attribute and class weighting as it was done by~\citet{Guenther:14}.
% This way, \citeauthor{Hamdan:15} achieved 0.643~\F on the SemEval-2015
% test get, getting third place among all competitors.

% One of the first attempts to analyze message-level sentiments on
% Twitter was made by \citet{Go:09}.  For their experiments, the authors
% collected a set of 1,600,000 tweets containing smileys.  Based on
% these emoticons, they automatically derived polarity classes for these
% messages (positive or negative) and used them to train a Na\"{\i}ve
% Bayes, MaxEnt, and SVM classifier.  The best $F$-score for this
% two-class classification problem could be achieved by the last system
% and run up to 82.2\%.

% Similar work was also done by \citet{Pak:10} who used the Na\"{\i}ve
% Bayes approach to differentiate between neutral, positive, and
% negative microblogs; and \citet{Barbosa:10} who gathered a collection
% of 200,000 tweets, subsequently analyzing them with three publicly
% available sentiment web-services and training an SVM classifier on the
% results of these predictors.  In a similar way, \citet{Agarwal:11}
% compared a simple unigram-based SVM approach with two other
% full-fledged systems, one which relied on a rich set of manually
% defined features, and another used partial tree
% kernels~\cite{Moschitti:06}.  The authors evaluated these methods on a
% commercially acquired corpus of 8,753 foreign-language tweets, which
% were automatically translated into English, finding that a combination
% of these methods worked best for both two- and three-way prediction
% tasks.

% The state-of-the-art results for message level polarity prediction on
% tweets were established by~\citet{Mohammad:13}, whose system (a
% supervised SVM classifier) used a rich set of various features
% including word and character n-grams, PoS statistics, Brown
% clusters~\cite{Brown:92}, etc., and also strongly benefitted from
% automatic corpus-based polarity lists---Sentiment~140 and NRC
% Hashtag~\cite{Mohammad:12,Kiritchenko:14}.  This approach ranked first
% at the SemEval competition~2013~\cite{Nakov:13} and anchieved the
% fourth place on the rerun of this task one year
% later~\cite{Rosenthal:14}, being outperformed by the supervised
% logistic regression approach of~\citet{Miura:14}, who used a heavy
% preprocessing of the data and a special balancing scheme for
% underrepresented classes.  Later on, these results were further
% improved by the apporaches of~\citet{Hagen:15} and \citet{Deriu:16},
% which both relied on ensembles of multiple independent classifiers.

\subsection{Feature Analysis}\label{subsec:cgsa:ml-methods:feature-analysis}

Because input features appeared to play a crucial role for the success
of ML-based systems, we decided to investigate the impact of this
factor in more detail and performed an ablation test for each of the
tested classifiers, removing one of their feature groups at a time and
recomputing their scores.

\begin{table}[h]
  \begin{center}
    \bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
    \begin{tabular}{p{0.2\columnwidth} % first columm
        *{6}{>{\centering\arraybackslash}p{0.13097\columnwidth}}}
      \toprule
      \multirow{2}{0.145\columnwidth}{%
      \bfseries Features} & %
      \multicolumn{6}{c}{\bfseries System Scores}\\
      & \multicolumn{2}{c}{\bfseries GMN} & \multicolumn{2}{c}{\bfseries MHM} %
      & \multicolumn{2}{c}{\bfseries GNT}\\%
      \cmidrule(lr){2-3}\cmidrule(lr){4-5}\cmidrule(lr){6-7}

      & Macro\newline \F{}$^{+/-}$ & Micro\newline \F{} %
      & Macro\newline \F{}$^{+/-}$ & Micro\newline \F{} %
      & Macro\newline \F{}$^{+/-}$ & Micro\newline \F{}\\\midrule

      \multicolumn{7}{c}{\cellcolor{cellcolor}PotTS}\\
      All & \textbf{0.453} & \textbf{0.617} & \textbf{0.674} & 0.727 & \textbf{0.624} & 0.673\\
      --Constituents & 0.388 & 0.545 & \NA{} & \NA{} & \NA{} & \NA{}\\
      --PoS Tags & 0.417 & 0.607 & 0.669 & 0.721 & \NA{} & \NA{}\\
      --Character Features & \NA{} & \NA{} & 0.671 & \textbf{0.734} & \NA{} & \NA{}\\
      --Token Features & \NA{} & \NA{} & 0.659 & 0.704 & 0.0 & 0.366\\
      --Automatic Lexicons & \NA{} & \NA{} & 0.667 & 0.717 & 0.613 & 0.666\\
      --Manual Lexicons & \NA{} & \NA{} & 0.665 & 0.715 & 0.617 & \textbf{0.675}\\

      \multicolumn{7}{c}{\cellcolor{cellcolor}SB10k}\\
      All & \textbf{0.329} & 0.699 & 0.564 & 0.752 & 0.491 & 0.724\\
      --Constituents & 0.127 & 0.646 & \NA{} & \NA{} & \NA{} & \NA{}\\
      --PoS Tags & 0.301 & \textbf{0.7} & \textbf{0.57} & \textbf{0.757} & \NA{} & \NA{}\\
      --Character Features & \NA{} & \NA{} & 0.546 & 0.753 & \NA{} & \NA{}\\
      --Token Features & \NA{} & \NA{} & 0.559 & 0.741 & 0.046 & 0.62\\
      --Automatic Lexicons & \NA{} & \NA{} & 0.54 & 0.753 & \textbf{0.517} & 0.735\\
      --Manual Lexicons & \NA{} & \NA{} & 0.553 & 0.751 & 0.51 & \textbf{0.739}\\\bottomrule
    \end{tabular}
    \egroup{}
    \caption[Feature-ablation test of ML-based MLSA methods]{
      Results of the feature-ablation test for ML-based MLSA methods}\label{snt-cgsa:tbl:ml-res-ablation}
  \end{center}
\end{table}

As we can see from the results in
Table~\ref{snt-cgsa:tbl:ml-res-ablation}, the approach
of~\citet{Gamon:04} typically achieves its best performance when all
of the input attributes (PoS tags and syntactic constituents) are
active.  This is for example the case for the micro- and
macro-averaged \F{} on the PotTS corpus, and also holds for the
two-class macro-\F{} on the SB10k data.  The only exception to this
tendency is the micro-averaged \F{}-score on the latter dataset, which
shows a slight improvement (from 0.699 to 0.7) after the removal of
part-of-speech features.

Similarly, the analyzer of~\citet{Mohammad:13} seems to rather suffer
than benefit from the part-of-speech attributes, which decrease its
micro-averaged scores by almost 0.07 points on PotTS and 0.05~\F{} on
SB10k.  One possible explanation for this degradation could be the
differences in the utilized PoS taggers and tagsets: Whereas the
original \citeauthor{Mohammad:13}'s classifier relied on a special
Twitter-aware tagger~\cite{Owoputi:13}, whose tags were explicitly
adjusted to the peculiarities of social media texts (including special
labels for the @-mentions and \#hashtags), we instead used the output
of the standard \textsc{TreeTagger}~\cite{Schmid:95}, which, apart
from lacking any Twitter-specific information, was also trained on a
completely different text genre (newspaper articles) and therefore a
priori produced unreliable output.  As a consequence, the effect of
part-of-speech information is rather harmful, and the only aspect
where it comes in handy is the macro-averaged \F{} on the PotTS
corpus, which improves by 0.003 when these features are used.  A
better alternative in this regard could be the Twitter-specific tagger
for German developed by~\citet{Rehbein:13}, we could not, however,
find this tagger in the public domain, and, moreover, its usage would
preclude the following \textsc{Mate} analysis due to the difference in
the tagsets.

An even more controversial situation is observed with the classifier
of~\citet{Guenther:14}. Although this system lacks any part-of-speech
attributes, its reaction to the deletion of other features (first of
all token and lexicon traits) is quite unexpected.  For example, the
macro-averaged \F{}-scores on both corpora drop almost to zero when
the information about tokens is excluded.  On the other hand, the
deactivation of manual lexicons surprisingly improves the
micro-averaged results on both datasets and also increases the
macro-\F{}$^{+/-}$ on the SB10k data.  We also notice a similar
(though less pronounced) trend with automatic lexicons: the ablation
of these features lowers the scores on PotTS, but improves both
results on SB10k.  We can partially explain this negative effect of
polarity lists by the coarseness of lexicon features: This classifier
uses only binary attributes, which reflect whether the given tweet has
more positive or more negative lexicon items, but it does not
distinguish between the scores or intensities of these terms.

Besides analyzing the utility of each particular feature group, we
also decided to have a look at the top-10 most relevant attributes
learned by each system.  The summarized overview in
Table~\ref{fgsa:tbl:ml:to10-features} partially confirms our previous
findings: For example, the most useful traits for the analyzer
of~\citet{Gamon:04} are attributes reflecting the information about
both constituents and part-of-speech tags, with five of its ten
entries featuring the interjection tag, which appears to be especially
important for predicting the positive class.  On the other hand, the
system of~\citet{Mohammad:13} seems to rely more on token and
character $n$-grams, as nine out of ten attributes belong to either of
these two categories. The only outlier in this respect is the
\texttt{Last\%QMarkCnt} attribute (line 2), which denotes the presence
of a question mark and is apparently a good clue of neutral
microblogs.  Finally, the classifier of~\citet{Guenther:14} almost
exclusively prefers lexical $n$-grams, as it has nine unigrams and one
bigram among its top-ten entries.

\begin{table}[hbt]
  \begin{center}
    \bgroup\setlength\tabcolsep{0.47\tabcolsep}\scriptsize
    \begin{tabular}{>{\centering\arraybackslash}p{0.05\columnwidth} % first columm
        *{9}{>{\centering\arraybackslash}p{0.092\columnwidth}}} % next four columns
      \toprule
      \multirow{2}{0.05\columnwidth}{Rank} & \multicolumn{3}{c}{\bfseries GMN} & %
                      \multicolumn{3}{c}{\bfseries MHM} & %
                      \multicolumn{3}{c}{\bfseries GNT}\\\cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}
      & Feature & Label & Weight & Feature & Label & Weight %
      & Feature & Label & Weight\\\midrule
          1 & NK-ITJ| & POS & 0.457 & * & NEUT & 0.131 & hate & NEG & 1.86 \\
          2 & DM-ITJ| & POS & 0.334 & Last\-\%QMark\-Cnt & NEUT & 0.088 & sick & NEG & 1.7\\
          3 & V-DM-I & POS & 0.244 & s-c & NEG & 0.079 & kahretsinn & NEG & 1.69\\
          4 & N-NK-I & POS & 0.24 & *-\%possmiley & POS & 0.067 & dasisaberschade & NEG & 1.69\\
          5 & MO-ITJ| & POS & 0.211 & c-h-e-i-s & NEG & 0.064 & Anziehen & POS & 1.67\\
          6 & A-DM-I & POS & 0.196 & h-a-h & POS & 0.064 & \textbackslash{}x016434 & POS & 1.65\\
          7 & A-MO-I & POS & 0.191 & t-\textvisiblespace{}-. & NEG & 0.064 & p\"archenabend & POS & 1.65\\
          8 & NK-ITJ & POS & 0.165 & geil & POS & 0.062 & derien\heart\heart{} & POS & 1.65\\
          9 & NK-\$. & NEUT & 0.16 & *-? & NEUT & 0.062 & sch\"on-nicht & POS & 1.56\\
          10 & DM-ITJ & POS & 0.157 & ? & NEUT & 0.061 & applause & POS & 1.5\\\bottomrule
    \end{tabular}
    \egroup{}
    \caption[Top-10 features learned by MLSA classifiers]{Top-10
      features learned by ML-based MLSA methods\\{\small (sorted by
        the absolute values of their weights)}}\label{fgsa:tbl:ml:to10-features}
  \end{center}
\end{table}

\subsection{Classifiers}\label{subsec:cgsa:ml-methods:classifiers-analysis}

Another important factor that could significantly affect the quality
of ML-based approaches was the underlying classification method, which
was used to optimize the feature weights and make the final
predictions.  Although most of the previous studies agree on the
superior performance of support vector machines for this
task~\cite[see ][]{Pang:02,Gamon:04,Mohammad:13}, we decided to
question these conclusions as well and reran our experiments,
replacing the linear SVC predictor with the Na\"{\i}ve Bayes and
Logistic Regression algorithms.

Somewhat surprisingly, these changes indeed resulted in an
improvement, especially in the case of the logistic classifier, which
yielded the best macro- and micro-averaged scores for the systems of
\citet{Mohammad:13} and \citet{Guenther:14} on the PotTS corpus (see
Table~\ref{snt-cgsa:tbl:ml-res-classifiers}) and also produced the
highest micro-\F{} results for these two approaches on the SB10k
dataset.  Nevertheless, the SVM algorithm still remains a competitive
option, in particular for the feature-sparse method
of~\citet{Gamon:04}, but also with respect to the macro-\F{} of
Mohammmad's and G\"unther's analyzers.

\begin{table}[h]
  \begin{center}
    \bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
    \begin{tabular}{p{0.2\columnwidth} % first columm
        *{6}{>{\centering\arraybackslash}p{0.13097\columnwidth}}}
      \toprule
      \multirow{2}{0.15\columnwidth}{%
      \bfseries Classifier} & %
      \multicolumn{6}{c}{\bfseries System Scores}\\
      & \multicolumn{2}{c}{\bfseries GMN} & \multicolumn{2}{c}{\bfseries MHM} %
      & \multicolumn{2}{c}{\bfseries GNT}\\%
      \cmidrule(lr){2-3}\cmidrule(lr){4-5}\cmidrule(lr){6-7}

      & Macro\newline \F{}$^{+/-}$ & Micro\newline \F{} %
      & Macro\newline \F{}$^{+/-}$ & Micro\newline \F{} %
      & Macro\newline \F{}$^{+/-}$ & Micro\newline \F{}\\\midrule

      \multicolumn{7}{c}{\cellcolor{cellcolor}PotTS}\\
      SVM & \textbf{0.453} & \textbf{0.617} & 0.674 & 0.727 & \textbf{0.624} & 0.673\\
      Na\"{\i}ve Bayes & 0.432 & 0.577 & 0.635 & 0.675 & 0.567 & 0.59\\
      Logistic Regression & 0.431 & 0.612 & \textbf{0.677} & \textbf{0.741} & \textbf{0.624} & \textbf{0.688}\\

      \multicolumn{7}{c}{\cellcolor{cellcolor}SB10k}\\
      SVM & 0.329 & \textbf{0.699} & \textbf{0.564} & 0.752 & 0.491 & 0.724\\
      Na\"{\i}ve Bayes & \textbf{0.351} & 0.637 & 0.516 & 0.755 & 0.453 & 0.675\\
      Logistic Regression & 0.309 & 0.693 & 0.553 & \textbf{0.772} & \textbf{0.512} & \textbf{0.75}\\\bottomrule
    \end{tabular}
    \egroup{}
    \caption{
      Results of ML-based MLSA methods with different classifiers}\label{snt-cgsa:tbl:ml-res-classifiers}
  \end{center}
\end{table}

Even though our results contradict previous claims in the literature,
we would advise against premature conclusions at this point and stress
the fact that different classifiers might have fairly varying results
on different datasets.  Therefore, higher scores of the logistic
regression on our corpora do not preclude better SVM results on the
official SemEval data.

\subsection{Error Analysis}\label{subsec:cgsa:ml-methods:err-analysis}

As in our previous experiments, we also decided to have a closer look
at errors produced by each tested system.  For this purpose, we again
collected misclassifications that were unique to only one of the
classifiers, and provide some examples of these errors below.

The first wrong result shown in
Example~\ref{snt:cgsa:exmp:gamon-error} was produced by the system
of~\citet{Gamon:04}.
\begin{example}[An Error Made by the System of~\citeauthor{Gamon:04}]\label{snt:cgsa:exmp:mohammad-error}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape Das ist das zynische. \"Uber Themen labern, Leute schlecht machen. Wenn nicht der Papst damit Thema w\"are, kein Wort. Ich hasse das.}\\
  \noindent It's cynical.  To babble about topics, to talk people
  down.  If the topic wouldn't be the Pope, no word. I hate
  this.\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{midnightblue}{negative}}\\
 \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{green3}{positive*}}\label{snt:cgsa:exmp:gamon-error}
\end{example}
\noindent{} In this case, the classifier incorrectly assigned the
positive label to a clearly negative microblog despite the presence of
multiple negatively connoted terms (``zynische'' [\emph{cynical}],
``labern'' [\emph{to babble}], ``schlecht machen'' [\emph{to talk
    down}], and ``hasse'' [\emph{to hate}]).  The reason for this
decision is quite simple: As we already noted in the foregoing
description, this method is completely unlexicalized and relies only
on grammatical information while making its predictions.  In
particular, for this microblog, the top-5 most important features
(ranked by the absolute values of their coefficients) are:
\begin{enumerate}
\item \texttt{PD-ADJA} (\textcolor{black}{neutral}): -0.62896911412,
\item \texttt{---VVINF} (\textcolor{midnightblue}{negative}): 0.517300341184,
\item \texttt{PD-ADJA} (\textcolor{green3}{positive}): 0.505413668274,
\item \texttt{---VVINF} (\textcolor{green3}{positive}): -0.346990702756,
\item \texttt{CJ-VVINF} (\textcolor{green3}{positive}):
  0.303311030403.
\end{enumerate}
As we can see, none of these attributes reflects any information about
the lexical terms appearing in the message, and the system simply
prefers the positive class based on the presence of a predicate
adjective (\texttt{PD-ADJA}) and coordinately conjoined infinitive
(\texttt{CJ-VVINF}).

Another error shown in Example~\ref{snt:cgsa:exmp:mohammad-error} was
made by the system of~\citet{Mohammad:13}.  This time, a positive
tweet was misclassified as neutral.  But the reason for this erroneous
decision is completely different.  As we can see from the list of the
highest ranked features given below:
\begin{enumerate}
\item \texttt{*} (\textcolor{black}{neutral}): 0.131225868029,
\item \texttt{*} (\textcolor{midnightblue}{negative}): -0.0840804221845,
\item \texttt{\%PoS-CARD} (\textcolor{black}{neutral}): 0.0833658576233,
\item \texttt{\%PoS-ADJD} (\textcolor{black}{neutral}): -0.069745190018,
\item \texttt{t-\textvisiblespace{}-n} (\textcolor{green3}{positive}): 0.0556721202587;
\end{enumerate}
this analyzer makes its decision based on rather general, but
extremely heavy-weighted features, such as placeholder token
\texttt{*} or the PoS-tag features (\texttt{\%PoS-CARD} and
\texttt{\%PoS-ADJD}).  As a result, its prediction succumbs to the
neutral bias of these general attributes.

\begin{example}[An Error Made by the System of~\citeauthor{Mohammad:13}]\label{snt:cgsa:exmp:mohammad-error}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape das
    klingt richtig gut! Was f\"ur eine hast du denn? (uvu) \%PosSmiley3}\\
  \noindent It sounds really great.  Which one do you have? (uvu) \%PosSmiley3\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{green3}{positive}}\\
 \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{black}{neutral*}}
\end{example}

Finally, the last example (\ref{snt:cgsa:exmp:guenther-error}) shows
another wrong decision, where a negative microblog was incorrectly
analyzed as positive by the method of~\citet{Guenther:14}, even though
the polar term ``borniert'' (\emph{narrow-minded}) was present in both
utilized sentiment lexicons (ZPL and Linear Projection) as a negative
item.  This again can be explained by the prevalence of general
features (\eg{} \texttt{8}, \texttt{nicht-nur\_NEG},
\texttt{nur\_NEG}, etc.) and their strong bias towards the majority
class in the PotTS dataset.

\begin{example}[An Error Made by the System of~\citeauthor{Guenther:14}]\label{snt:cgsa:exmp:guenther-error}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape Den CDU-W\"ahlern traue ich durchaus zu der FDP 8 bis 9\% zu bescheren! Die sind so borniert, nicht nur in Niedersachsen!}\\
  \noindent I don't put giving 8 to 9\% to the FDP past the CDU-voters!  They are so narrow-minded, not only in Lower Saxony!\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{midnightblue}{negative}}\\
 \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{green3}{positive}}
\end{example}

%% As the final remark, we should note that albeit the neutral bias is a
%% characteristic type of mistakes for the classifiers
%% of~\citet{Mohammad:13} and \citet{Guenther:14}, in general, these
%% systems perform fairly well, and most of their misclassifications are
%% quite arguable cases, where even a human expert would doubt about the
%% correct prediction.

\section{Deep-Learning Methods}\label{sec:cgsa:dl-based}

Even though traditional ML-based approaches still show competitive
results and play an important role in the sentiment analysis of social
media, they are gradually giving place to allegedly more powerful and
in a certain sense more intuitive deep learning (DL) methods.  As we
already mentioned in the previous chapter, in contrast to the standard
supervised techniques with human-engineered features, DL systems
induce the best feature representation completely automatically, and
in some cases might produce even better features than the ones devised
by human experts.  Another important advantage of this paradigm is its
more straightforward way to implement the ``compositionality'' of
language \cite{Frege:1892}: Whereas conventional classifiers usually
consider each instance as a \emph{bag of features} and predict its
label based on the sum of these features' values multiplied with their
respective weights, DL approaches try to \emph{combine} the
representation of each part of that instance (be it tokens or
sentences) into a single whole and then deduce the final class from
this joint embedding.

% \todo{Unite into a single ``compositionality'' paragraph.}

% \done[inline]{Moilanen and Pulman 2007}

% Among the first who explicitly applied the principle of
% compositionality to the task of message-level sentiment anlysis were
% \citet{Moilanen:07}.  In their work on sentiment composition, the
% authors first determined the polarity of terminal nodes in an
% automatically costructed syntactic tree, and then used a meticulously
% developed set of manual rules in a bottom-up fashion to propagate
% semantic orientation of descendants to their grammatical parent nodes.
% Considering the polarity of the root as the sentiment class of the
% whole sentence, \citet{Moilanen:07} achieved 91.33\% accuracy in
% predicting the polarities of news headlines from SemEval-2007
% \cite{Strapparava:07}.

% \done[inline]{\citet{Choi:08}}

% \citet{Choi:08} directly incorporated compositional semantic
% heuristics into the training and inference procedure of their method.
% For this purpose, the authors first used a simple SVM classifier to
% predict a hidden semantic class
% $z_i\in\{\textrm{positive}, \textrm{negative}, \textrm{negator},
% \textrm{none}\}$ for each token $x_i$ of the input sequence
% $\mathbf{x}:=\left(x_1,\ldots,x_i\right)$, and then applied a set of
% hand-coded semantic rules (such as ``swap the polarity to the opposite
% if the preceding token was a negator'') to derive the overall semantic
% orientation of the whole instance.  With this two-stage system,
% \citet{Choi:08} were able to outperform both a purely rule-based
% approach and bag-of-features SVM method, reaching 90.7\% accuracy on
% predicting the contextual polarity of polar terms in the MPQA corpus
% \cite{Wiebe:05}.

% \done[inline]{\citet{Nakagawa:10}}

% Another compositional method was proposed by~\citet{Nakagawa:10}, who
% applied tree-structured CRFs to automatically derived dependency
% trees, trying to predict the polarity of the root and considering
% semantic orientation of intermediate nodes as hidden variables.  Using
% an extensive set of features including sentiment lexicons and polarity
% reversal lists, the authors achieved 86.1\% accuracy on the MPQA
% corpus \cite{Wiebe:05}, outperforming plain rule-based and
% bag-of-features approaches.

Among the first who explicitly incorporated the compositionality
principle into a DL-based sentiment application were
\citet{Yessenalina:11}.  In their proposed matrix-space approach, the
authors represented each word $w$ of an input phrase $\mathbf{x}^i =
w^i_1, w^i_2, \ldots, w^i_{|\mathbf{x}^i|}$ as a matrix
$W_{w}\in\mathbb{R}^{m\times m}$ and computed the sentiment score
$\xi^i$ of this phrase as the product of its token matrices,
multiplying the final result with two auxiliary model parameters
$\vec{u}$ and $\vec{v}\in\mathbb{R}^m$ to get a scalar value:
\begin{align*}
  \xi^i =& \vec{u}^\top\left(\prod_{j=1}^{|x^i|}W_{w^i_j}\right)\vec{v}.
\end{align*}
After computing this term, they predicted the intensity and polarity
of the phrase on a five-level sentiment scale (ranging from very
negative to very positive) by comparing $\xi^i$ with automatically
derived thresholds.  With this system, \citeauthor{Yessenalina:11}
attained a ranking loss of 0.6375 on the MPQA corpus~\cite{Wiebe:05},
outperforming the traditional PRank algorithm \cite{Crammer:01} and
bag-of-words ordered logistic regression.
% method, \citet{Yessenalina:11} outperformed the PRank algorithm and
% bag-of-words ordered logistic regression on the MPQA
% corpus~\cite{Wiebe:05}, attaining the lowest ranking loss (0.6375)
% among these systems.

Almost simultaneously with this work, \citet{Socher:11} introduced a
deep recursive autoencoder (RAE), in which they obtained a fixed-width
vector representation for a complex phrases $\vec{w}_p$ by recursively
merging the vectors of its tokens over a binarized dependency tree,
first multiplying these vectors with a compositional matrix $W$ and
then applying a non-linear function ($softmax$) to the resulting
product:
\begin{align}
  \vec{w}_p &= softmax\left(W\begin{bmatrix}
      \vec{w}_l\\
      \vec{w}_r
  \end{bmatrix}\right),\label{cgsa:eq:socher-11}
\end{align}
where $\vec{w}_l$ and $\vec{w}_r$ represent the embeddings of the left
and right dependents respectively.  By applying a max-margin
classifier to the final phrase vector, the authors could improve the
state of the art on predicting sentence-level polarity of user's blog
posts~\cite{Potts:10} and also outperformed the system
of~\citet{Nasukawa:03} on the MPQA dataset~\cite{Wiebe:05}, achieving
86.4\% accuracy on predicting contextual polarity of opinionated
expressions.

Later on, \citet{Socher:12} further improved this approach by
associating an additional matrix $W_w$ with each vocabulary word $w$
and performing the inference simultaneously over both vector and
matrix representations:
\begin{align*}
  \vec{w}_p = \tanh\left(W_v \begin{bmatrix}W_r\vec{w}_l\\
      W_l \vec{w}_r\end{bmatrix} \right),\\
  W_p = W_m \begin{bmatrix}W_l;\\
    W_r\end{bmatrix};
\end{align*}
where $\vec{w}_p\in\mathbb{R}^n$ stands for the embedding of the
parent node, $\vec{w}_l$ and $\vec{w}_r$ represent the embeddings of
its left and right dependents, and $W_p, W_l, W_r \in
\mathbb{R}^{n\times n}$ denote the respective matrices associated with
these vertices.  The compositionality matrices
$W_v\in\mathbb{R}^{n\times 2n}$ and $W_m\in\mathbb{R}^{n\times 2n}$
were shared across all instances and learned along with the vector
embeddings.  This model, called Matrix-Vector Recursive Neural Network
(MVRNN), surpassed the RAE system on the IMDB movie review
dataset~\cite{Pang:05}, attaining 0.91 Kullback-Leibler divergence
between the assigned scores and probabilities of correct labels.

Yet another improvement, a Recursive Neural Tensor Network (RNTN), was
presented by~\citet{Socher:13}.  In this system, the authors again
opted for a vector representation of words, but enhanced the original
matrix-vector product from Equation~\ref{cgsa:eq:socher-11} with an
additional tensor multiplication:
\begin{align*}
  \vec{w}_p &= softmax\left(\begin{bmatrix}
  \vec{w}_l\\
  \vec{w}_r
  \end{bmatrix}^{\top}V^{[1:d]}\begin{bmatrix}
  \vec{w}_l\\
  \vec{w}_r
  \end{bmatrix}
            + W\begin{bmatrix}
  \vec{w}_l\\
  \vec{w}_r
\end{bmatrix}\right),\label{cgsa:eq:socher-13}
\end{align*}
where $\vec{w}_p, \vec{w}_l, \vec{w}_r\in\mathbb{R}^n$, and
$W\in\mathbb{R}^{n\times 2n}$ are defined as before; and $V$
represents a $2n\times 2n\times n$-dimensional tensor.  By increasing
this way the number of parameters in comparison with the RAE approach,
but significantly reducing it with respect to the MVRNN method, the
authors gained a significant improvement of the results, boosting the
classification accuracy on their own Stanford Sentiment Treebank from
82.9 to 85.4\%.

A real breakthrough in the use of deep learning methods for sentiment
analysis of Twitter happened with the work~\citet{Severyn:15}, whose
proposed feed-forward DL system ranked first in SemEval-2015
Subtask~10~A (phrase-level polarity prediction) \cite{Rosenthal:15}
and achieved second place (0.6459~\F$^{+/-}$) in Subtask~10~B
(message-level classification) of this competition.  Drawing on the
ideas of~\citet{Kalchbrenner:14}, the authors devised a simple
convolutional network in which they multiplied pretrained word
embeddings with 300 distinct convolutional kernels each of width 5,
pooled the maximum value of this multiplication for each kernel, and
then passed the results of this pooling to a piecewise linear ReLU
filter with a densely connected softmax layer.  An important aspect of
this approach, which accounted for a huge part of its success, was a
special multi-stage training scheme that was used to optimize the
parameters: In the initial stage of this scheme,
\citeauthor{Severyn:15} first computed Twitter-specific word
embeddings by applying the word2vec algorithm to a large Twitter
corpus.  Afterwards, they pretrained the complete system including the
word vectors, convolutional filters, and inter-layer matrices on a big
set of noisily labeled microblogs from this collection, and, finally,
fine-tuned the parameters of the model on the official SemEval
dataset.

Later on, this system was further improved by \citet{Deriu:16}, who
increased the number of convolutional layers (applying two layers
instead of one) and simultaneously trained two such models (using
word2vec vectors as input for the first one and passing GloVe
embeddings to the second), joining their output at the end and
achieving this way 0.671~\F$^{+/-}$ on the SemEval-2015 test set.  A
similar enhancement was also proposed by~\citet{Rouvier:16}, who used
three different types of embeddings (word2vec, word2vec specific to
particular parts of speech, and sentiment-tailored vectors), training
separate sets of convolutions for each of these types.

% \done[inline]{SemEval 2016}

% \done[inline]{\citet{Deriu:16}}

% \citet{Deriu:16} extended the system of~\citet{Severyn:15} by
% increasing the number of convolutional layers (using two layers
% instead of just one).  Furthermore, to improve the generalizability of
% their system, the authors trained two such networks with different
% types of input embeddings, taking pretrained word2vec
% vectors~\cite{Mikolov:13} for one classifier and using GloVe
% embeddings~\cite{Pennington:14} for another one.  Similarly to
% \citet{Severyn:15}, \citeauthor{Deriu:16} first fine-tuned these word
% representations on a big collection of tweets, maximizing the context
% prediction objective.  Afterwards, they pretrained the parameters of
% the networks including embeddings and convolutional filters on a big
% set of noisily labeled tweets, and, finally, put the finishing touches
% on the weights of these systems by training them on the officially
% released SemEval data.  The authors united the predictions of both
% networks by training a random forest classifier on top of their output
% vectors, establishing a new state of the art (0.633~\F{}) on the
% SemEval-2016 data.

% \done[inline]{\citet{Rouvier:16}}

% Another approach building on the work of~\citet{Severyn:15} was
% proposed by~\citet{Rouvier:16}.  In contrast to the former system
% which utilized only pre-trained sentiment-flavored word vectors with a
% single convolutional layer, the authors simultaneously harnessed three
% different types of embeddings (word2vec, word2vec specific to
% particular parts of speech, and sentiment-tailored emeddings) each of
% which was trained with a separate set of convolutional filters.
% Besides training deep representations, the authors also used
% hand-crafted features such as sentiment lexicons, emoticons,
% information about elongated words, punctuation marks, and capitalized
% tokens, training a separate multi-layer perceptron on these
% attributes.  In the final step, \citeauthor{Rouvier:16} joined the
% outputs of the two downstream classifiers (deep convolutions and
% perceptron) into a single vector, subsequently passing this vector to
% two fully connected neural network layers with softmax non-linearity
% at the end.  This way, the authors achieved 0.63~\F{} on the
% SemEval-2016 test set, getting second rank in this shared task.

% \done[inline]{\citet{Wang:15}}

% \citet{Wang:15} used an LSTM network to predict the polarity of
% tweets, attaining 84\% accuracy on distinguishing between positive and
% negative microblogs from the SemEval-2013 corpus \cite{Nakov:13}.

% \todo[inline]{\citet{Xu:16}}
%
% \citet{Xu:16} also chose an ensemble approach to message-level
% opinion mining of tweets.  In particular, the authors combined a
% convolutional system with 300 filters of widths three, four, and five
% (using 100 filters for each given width); an LSTM
% classifier~\cite{Hochreiter:97}, taking its output vector from the
% last time step $t$ as the final vote; and a special Bayesian wordvec
% system, which was trained to maximize the probability of a token given
% its surrounding context and the label of the tweet, maximizing this
% probability with the prior likelihood of the respective polarity.
% \citeauthor{Xu:16} united the decisions of these subsystems using soft
% weighting scheme:
% \begin{equation*}
%   y^* = \sum_i w_i y_i,\textrm{, s.t.} \sum_i w_i = 1, \forall i: w_i \geq 0,
% \end{equation*}
% where $y_i$ is the score of the label $y$ returned by the $i$-th
% classifier, and $w_i$ denotes an automatically learned weight for this
% prediction.  This submission achieved 0.617~\F{} on the two polarity
% classes (positive and negative), getting third place among all
% participating submissions.


% \done[inline]{\citet{Baziotis:17}, \citet{Cliche:17},
% \citet{Rouvier:17}}

Although convolutional approaches still show competitive scores and
are hard to outperform in practice, in recent time, they are gradually
being superseded by recurrent neural networks~\cite{Xu:16,Wang:15}.
One of the most prominent such systems has been recently proposed
by~\citet{Baziotis:17}.  In their submission to
SemEval~2017~\cite{Rosenthal:17}, the authors used two successive
bidirectional LSTM units (BiLSTMs).  In each of these units, they
concatenated the results of the left-to-right recurrence
($\vec{h}^{(l)}_{i_{\rightarrow}}\in\mathbb{R}^{150}$) with the
respective outputs of the right-to-left loop
($\vec{h}^{(l)}_{i_{\leftarrow}}\in\mathbb{R}^{150}$) and then passed
the result of this concatenation ($\vec{h}_i^{(l)} =
[\vec{h}^{(l)}_{i_{\rightarrow}},
  \vec{h}^{(l)}_{i_{\leftarrow}}]\in\mathbb{R}^{300}$) to the next
layer of the network.  After getting the output of the second BiLSTM,
they united the states of this unit from all time steps $i$ into a
single vector $\vec{a}$ with the help of a special attention
mechanism, in which they first multiplied each BiLSTM state
$\vec{h}_i$ with the respective globally normalized attention score
$a_i$ and then took the sum of these weighted vectors over all $i$
positions:
\begin{align}
  \vec{a} =&
  \sum_{i=1}^{|\mathbf{x}|}a_i\vec{h}^{(2)}_i,\nonumber\\
  \mbox{where }a_i =&
  \frac{\exp(e_i)}{\sum_{j=1}^{|\mathbf{x}|}\exp(e_j)},\nonumber\\
  \textrm{s.t. }e_i =&
  \tanh\left(\vec{\alpha}\vec{h}^{(2)}_i + \beta_i\right).
\end{align}\label{eq:cgsa:baziotis-attention}%
The $\vec{\alpha}$ and $\beta$ terms in the above equations denote the
attention parameters (score and bias), which are optimized during the
training process.  To make the final prediction,
\citeauthor{Baziotis:17} multiplied the attention vector $\vec{a}$
with matrix $W$ and computed element-wise softmax of this product,
getting probability scores for each of the three polarity classes and
choosing the label with the maximum score:
\begin{align}
  \hat{y} =& \argmax\left(softmax(W^\top\vec{a})\right).
\end{align}
\noindent With this approach, the authors attained the first first
place in Task~4 of SemEval-2017 (0.675~\F{}$^{+/-}$), being on a par
with the system of~\citet{Cliche:17} and even outperforming the method
of~\citet{Rouvier:17} despite the fact that both of these competitors
used ensembles of LSTMs and convolutional networks.

\begin{figure*}[htbp!]
{
  \centering
  \includegraphics[width=\linewidth]{img/baziotis.png}
}
\caption[Neural network of \citet{Baziotis:17}]{Architecture of the
  neural network proposed
  by~\citet{Baziotis:17}}\label{cgsa:fig:baziotis}
\end{figure*}

\subsection{Lexicon-Based Attention}

Even though the approach of~\citet{Baziotis:17} represents the current
state of the art in sentiment analysis of Twitter and yields
extraordinarily good results, in our opinion, this method has yet some
potential for improvements.  This, first of all, concerns the way how
attention coefficients are computed.  As we can see from
Equation~\ref{eq:cgsa:baziotis-attention}, the magnitude of the
attention score $a_i$ primarily depends on the absolute value of the
BiLSTM outputs and the bias term at the $i$-th position.  Albeit this
strategy is definitely plausible, assuming the fact that LSTMs shall
produce higher scores for polar tokens and presupposing that polar
terms near the end of the message will usually have a greater
influence on the net polarity of the tweet than subjective words at
its beginning, a crucial prerequisite for this strategy to work is
\begin{inparaenum}[(i)]
\item that the LSTM layer can already provide sufficiently reliable
  results and
\item that the bias terms do not overly boost the importance of
  irrelevant tokens that just accidentally appeared at favored
  positions.
\end{inparaenum}
Unfortunately, both of these prerequisites are rarely fulfilled in
practice.

In order to overcome these deficiencies, we augmented the original
architecture of~\citet{Baziotis:17} shown in
Figure~\ref{cgsa:fig:baziotis} with two additional types of attention:
\emph{lexicon-} and \emph{context-based} one.  In the former type, we
estimated the importance weight $b_i$ for position $i$ as the polarity
score of the word $w_i$, obtaining this value from our Linear
Projection lexicon and normalizing it by the sum of polarity scores
for all tweet tokens:
\begin{align*}
  \vec{b} =& \sum_{i=1}^{|\mathbf{x}|}b_i\vec{h}_i,\\
  b_i =& \frac{\exp(f_i)}{\sum_{j=1}^{|\mathbf{x}|}\exp(f_j)},\\
  \mbox{s.t. }f_i
  =& \left\{
  \begin{array}{ll}
    \tanh(abs(V[{w_i}]) + \epsilon) & \textrm{ if } w_i\in V\\
    \tanh(\epsilon) & \, \textrm{otherwise.} \\
  \end{array}
  \right.
\end{align*}\label{cgsa:eq:lba}%
This way, we hoped to force the network to pay more attention to the
BiLSTM outputs that were produced at the positions of polar terms
rather than favoring arbitrary words in the message.

Another important factor that could notably affect the polarity of a
microblog were the so-called \emph{valence
  shifters}~\cite[][]{Polanyi:06}---words and phrases such as ``kaum''
(\emph{hardly}) or ``nicht'' (\emph{not}) that could significantly
change (or even reverse) the semantic orientation of polar terms.  To
account for these phenomena, we added another type of attention---a
\emph{context-based} one, whose goal was to identify such shifters in
the message and give them bigger weights in the recursion.  To discern
these elements, we introduced a linear classifier that had to predict
the modifying power of a token $w_i$, given its original word
embedding $\vec{w}_i$ and the LSTM output of its parent in the
dependency tree times the lexicon-based attention score of that parent
($\vec{b}_p := b_p\vec{h}_p$).  To keep the resulting attention scores
within an appropriate range, we again used the same $\tanh$
transformation and global normalization over all positions as we did
in the previous two types:
\begin{align*}
  \vec{c} =& \sum_{i=1}^{|\mathbf{x}|}c_i\vec{h}_i,\\ c_i =&
  \frac{\exp(g_i)}{\sum_{j=1}^{|\mathbf{x}|}\exp(g_j)},\\ g_i =&
  \tanh\left(C [\vec{w}_i, \vec{b}_p]^{\top}\right).
\end{align*}
The $C$ term in the above equation represents a context-based
attention matrix $\mathbb{R}^{200 \times 100}$; the $\vec{w}_i$
variable denotes the word embedding of the $i$-th token; and the
$\vec{b}_p$ term stands for the value of vector $\vec{b}$ (the result
of lexicon-based attention from Equation~\ref{cgsa:eq:lba}) at
position $p$ (the index of syntactic parent of $w_i$).  With this
classifier, we hoped to amplify the importance of shifting words in
the cases when the immediate syntactic ancestors of these tokens were
highly subjective expressions (\eg{} ``Er hat die Pr\"ufung kaum
bestanden'' [\emph{He hardly passed the exam}] or ``Ich mag den neuen
Bundesminister nicht'' [\emph{I do not like the new federal
    minister}]), but ignore them when they did not relate to any
subjective term.

At last, to make the final prediction, we concatenated the outputs of
the three attention layers into a single matrix $A\in\mathbb{R}^{3
  \times 100}$ and multiplied it with a vector
$\vec{w}\in\mathbb{R}^{1\times{}100}$, applying softmax normalization
at the end:
\begin{align*}
  \vec{o} =& softmax\left(A\vec{w}^\top\right),\textrm{ where}\\
  A =& \begin{bmatrix}
    \vec{a}\\
    \vec{b}\\
    \vec{c}\end{bmatrix}.
\end{align*}

Since introducing additional attention types increased the number of
model parameters, we removed one of the intermediate Bi-LSTM layers in
the network to counterbalance this effect and report our results for
both settings: using one and two Bi-LSTM units (denoted as LBA$^{(1)}$
and LBA$^{(2)}$, respectively).  The final architecture of our
approach is shown in Figure~\ref{cgsa:fig:lba}.

% \done[inline]{\citet{Tang:14b}}

% A hybrid approach to message-level sentiment analysis was proposed by
% \citet{Tang:14b}, who trained a linear SVM classifier on top of
% sentiment-specific word embeddings and hand-crafted features.  To
% obtain the former representations, \citeauthor{Tang:14b} devised a
% simple feed-forward neural network similar to the one used
% by~\citet{Collobert:11}, which, for each token $t$, had to predict the
% probability that this token appeared in the surrounding context and
% the likelihood that $t$ occurred in a positive or negative microblog.
% Since this training required a substantial amount of data, the authors
% leveraged a big collection of automatically downloaded tweets,
% obtaining noisy sentiment labels for these microblogs with the weak
% supervision method of~\citet{Go:09}.  The second part of this
% system---manually designed features---were mostly inspired by the work
% of~\citet{Mohammad:13} and included $n$-grams, word clusters,
% information about emoticons, negtaion, elongated characters and
% punctuation marks.  Combining these different traits into a single
% feature vector resulted significantly improved classification
% accuracy, yielding 0.701~\F{} on the SemEval-2014 test set (second
% place among all competing systems).

\begin{figure*}[htbp!]
{ \centering \includegraphics[width=1.3\linewidth]{img/lba.png} }
\caption[Neural network with lexicon-based attention]{Architecture of
  the neural network with lexicon- and context-based
  attention}\label{cgsa:fig:lba}
\end{figure*}


To evaluate the performance of the previously presented methods and to
compare our lexicon-based attention system with these solutions, we
reimplemented the approaches of~\citet{Yessenalina:11},
\citet{Socher:11,Socher:12,Socher:13}, \citet{Severyn:15},
and~\citet{Baziotis:17}.  For the sake of uniformity and simplicity,
we used task-specific word embeddings of size~$\mathbb{R}^{100}$ in
all systems, optimizing these vectors along with other network
parameters during the training.  Moreover, we also unified the final
activation parts and cost functions of all networks, using a densely
connected softmax layer as the last component of each classifier and
optimizing their weights w.r.t. the categorical hinge loss on the
training data, picking the values that yielded the highest accuracy on
the development set.

\begin{table}[h]
  \begin{center}
    \bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
    \begin{tabular}{p{0.162\columnwidth} % first columm
        *{9}{>{\centering\arraybackslash}p{0.074\columnwidth}} % next nine columns
        *{2}{>{\centering\arraybackslash}p{0.068\columnwidth}}} % last two columns
      \toprule
      \multirow{2}*{\bfseries Method} & %
      \multicolumn{3}{c}{\bfseries Positive} & %
      \multicolumn{3}{c}{\bfseries Negative} & %
      \multicolumn{3}{c}{\bfseries Neutral} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Macro\newline \F{}$^{+/-}$} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Micro\newline \F{}}\\
      \cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}

      & Precision & Recall & \F{} & %
      Precision & Recall & \F{} & %
      Precision & Recall & \F{} & & \\\midrule

      \multicolumn{12}{c}{\cellcolor{cellcolor}PotTS}\\

      %% Yessenalina Commands:
      %% -----------------
      %% cgsa_sentiment train -t yessenalina \
      %% data/PotTS/preprocessed/train/*.tsv data/PotTS/preprocessed/dev/*.tsv

      %% cgsa_sentiment test -m cgsa/data/models/cgsa.model data/PotTS/preprocessed/test/*.tsv\
      %% > data/PotTS/preprocessed/predicted/yessenalina/yessenalina.test

      %% cgsa_evaluate data/PotTS/preprocessed/test/ \
      %% data/PotTS/preprocessed/predicted/yessenalina/yessenalina.test

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.45      1.00      0.62       680
      %% negative       0.00      0.00      0.00       287
      %% neutral       0.00      0.00      0.00       558
      %% avg / total       0.20      0.45      0.28      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 30.84%
      %% Micro-Averaged F1-Score (All Classes): 44.5902%
      Y\&C & 0.45 & \textbf{1.0} & 0.62 & %
      0.0 & 0.0 & 0.0 & %
      0.0 & 0.0 & 0.0 & %
      0.308 & 0.446\\

      %% RAE Commands:
      %% -----------------
      %% cgsa_sentiment train -t rnn \
      %% data/PotTS/preprocessed/train/*.tsv data/PotTS/preprocessed/dev/*.tsv

      %% cgsa_sentiment test -m cgsa/data/models/cgsa.model data/PotTS/preprocessed/test/*.tsv\
      %% > data/PotTS/preprocessed/predicted/rnn/rnn.test

      %% cgsa_evaluate data/PotTS/preprocessed/test/ \
      %% data/PotTS/preprocessed/predicted/rnn/rnn.test

      %% RAE Results:
      %% ----------------
      %% General Statistics:
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.64      0.78      0.70       680
      %% negative       0.38      0.04      0.08       287
      %% neutral       0.57      0.68      0.62       558
      %% avg / total       0.56      0.60      0.55      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 38.90%
      %% Micro-Averaged F1-Score (All Classes): 60.4590%
      RAE & 0.64 & 0.78 & 0.7 & %
      0.38 & 0.04 & 0.08 & %
      0.57 & 0.68 & 0.62 & %
      0.389 & 0.605\\

      %% MVRNN Commands:
      %% -----------------
      %% cgsa_sentiment train -t mvrnn \
      %% data/PotTS/preprocessed/train/*.tsv data/PotTS/preprocessed/dev/*.tsv

      %% cgsa_sentiment test -m cgsa/data/models/cgsa.model data/PotTS/preprocessed/test/*.tsv\
      %% > data/PotTS/preprocessed/predicted/mvrnn/mvrnn.test

      %% cgsa_evaluate data/PotTS/preprocessed/test/ \
      %% data/PotTS/preprocessed/predicted/mvrnn/mvrnn.test

      %% MVRNN Results:
      %% ----------------
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.45      1.00      0.62       680
      %% negative       0.00      0.00      0.00       287
      %% neutral       0.00      0.00      0.00       558
      %% avg / total       0.20      0.45      0.28      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 30.84%
      %% Micro-Averaged F1-Score (All Classes): 44.5902%
      MVRNN & 0.45 & \textbf{1.0} & 0.62 & %
      0.0 & 0.0 & 0.0 & %
      0.0 & 0.0 & 0.0 & %
      0.308 & 0.446\\

      %% RNTN Commands:
      %% -----------------
      %% cgsa_sentiment train -t rntn \
      %% data/PotTS/preprocessed/train/*.tsv data/PotTS/preprocessed/dev/*.tsv

      %% cgsa_sentiment test -m cgsa/data/models/cgsa.model data/PotTS/preprocessed/test/*.tsv\
      %% > data/PotTS/preprocessed/predicted/rntn/rntn.test

      %% cgsa_evaluate data/PotTS/preprocessed/test/ \
      %% data/PotTS/preprocessed/predicted/rntn/rntn.test

      %% RNTN Results: *
      %% ----------------
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.45      0.87      0.59       680
      %% negative       0.19      0.02      0.03       287
      %% neutral       0.32      0.10      0.15       558
      %% avg / total       0.35      0.43      0.33      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 31.15%
      %% Micro-Averaged F1-Score (All Classes): 42.8197%
      RNTN & 0.45 & 0.87 & 0.59 & %
      0.19 & 0.02 & 0.03 & %
      0.32 & 0.1 & 0.15 & %
      0.312 & 0.428\\

      %% Severyn Commands:
      %% -----------------
      %% cgsa_sentiment train -t severyn \
      %% data/PotTS/preprocessed/train/*.tsv data/PotTS/preprocessed/dev/*.tsv

      %% cgsa_sentiment test -m cgsa/data/models/cgsa.model data/PotTS/preprocessed/test/*.tsv\
      %% > data/PotTS/preprocessed/predicted/severyn/severyn.test

      %% cgsa_evaluate data/PotTS/preprocessed/test/ \
      %% data/PotTS/preprocessed/predicted/severyn/severyn.test

      %% Severyn Results:
      %% ----------------
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.73      0.79      0.76       680
      %% negative       0.41      0.52      0.46       287
      %% neutral       0.72      0.55      0.62       558
      %% avg / total       0.67      0.65      0.65      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 60.84%
      %% Micro-Averaged F1-Score (All Classes): 65.1148%
      SEV & 0.73 & 0.79 & 0.76 & %
      \textbf{0.41} & \textbf{0.52} & \textbf{0.46} & %
      \textbf{0.72} & 0.55 & 0.62 & %
      \textbf{0.608} & 0.651\\

      %% Baziotis Commands:
      %% -----------------
      %% cgsa_sentiment train -t baziotis \
      %% data/PotTS/preprocessed/train/*.tsv data/PotTS/preprocessed/dev/*.tsv

      %% cgsa_sentiment test -m cgsa/data/models/cgsa.model data/PotTS/preprocessed/test/*.tsv\
      %% > data/PotTS/preprocessed/predicted/baziotis/baziotis.test

      %% cgsa_evaluate data/PotTS/preprocessed/test/ \
      %% data/PotTS/preprocessed/predicted/baziotis/baziotis.test

      %% Baziotis Results:
      %% ----------------
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.45      1.00      0.62       680
      %% negative       0.00      0.00      0.00       287
      %% neutral       0.00      0.00      0.00       558
      %% avg / total       0.20      0.45      0.28      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 30.84%
      %% Micro-Averaged F1-Score (All Classes): 44.5902%
      BAZ & 0.45 & \textbf{1.0} & 0.62 & %
      0.0 & 0.0 & 0.0 & %
      0.0 & 0.0 & 0.0 & %
      0.308 & 0.446\\

      % Lexicon-based Attention (1 BiLSTM) @
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.82      0.73      0.77       680
      %% negative       0.00      0.00      0.00       287
      %% neutral       0.56      0.92      0.69       558
      %% avg / total       0.57      0.66      0.60      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 38.71%
      %% Micro-Averaged F1-Score (All Classes): 66.2295%
      LBA$^{(1)}$ & \textbf{0.82} & 0.73 & \textbf{0.77} & %
      0.0 & 0.0 & 0.0 & %
      0.56 & \textbf{0.92} & \textbf{0.69} & %
      0.387 & \textbf{0.662}\\

      % Lexicon-based Attention (2 BiLSTM) @
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.45      1.00      0.62       680
      %% negative       0.00      0.00      0.00       287
      %% neutral       0.00      0.00      0.00       558
      %% avg / total       0.20      0.45      0.28      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 30.84%
      %% Micro-Averaged F1-Score (All Classes): 44.5902%
      LBA$^{(2)}$ & 0.45 & \textbf{1.0} & 0.62 & %
      0.0 & 0.0 & 0.0 & %
      0.0 & 0.0 & 0.0 & %
      0.308 & 0.446\\

      \multicolumn{12}{c}{\cellcolor{cellcolor}SB10k}\\

      %% Yessenalina Commands:
      %% -----------------
      %% cgsa_sentiment train -t yessenalina \
      %% data/SB10k/preprocessed/train/*.tsv data/SB10k/preprocessed/dev/*.tsv

      %% cgsa_sentiment test -m cgsa/data/models/cgsa.model data/SB10k/preprocessed/test/*.tsv\
      %% > data/SB10k/preprocessed/predicted/yessenalina/yessenalina.test

      %% cgsa_evaluate data/SB10k/preprocessed/test/ \
      %% data/SB10k/preprocessed/predicted/yessenalina/yessenalina.test

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.00      0.00      0.00       354
      %% negative       0.00      0.00      0.00       212
      %% neutral       0.62      1.00      0.77       930
      %% avg / total       0.39      0.62      0.48      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 0.00%
      %% Micro-Averaged F1-Score (All Classes): 62.1658%
      Y\&C & 0.0 & 0.0 & 0.0 & %
      0.0 & 0.0 & 0.0 & %
      0.62 & \textbf{1.0} & 0.77 & %
      0.0 & 0.622\\

      %% RAE Commands:
      %% -------------
      %% cgsa_sentiment train -t rnn \
      %% data/SB10k/preprocessed/train/*.tsv data/SB10k/preprocessed/dev/*.tsv

      %% cgsa_sentiment test -m cgsa/data/models/cgsa.model data/SB10k/preprocessed/test/*.tsv\
      %% > data/SB10k/preprocessed/predicted/rnn/rnn.test

      %% cgsa_evaluate data/SB10k/preprocessed/test/ \
      %% data/SB10k/preprocessed/predicted/rnn/rnn.test

      %% RAE Results:
      %% ------------
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.63      0.57      0.60       354
      %% negative       0.00      0.00      0.00       212
      %% neutral       0.75      0.94      0.83       930
      %% avg / total       0.61      0.72      0.66      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 29.94%
      %% Micro-Averaged F1-Score (All Classes): 72.0588%
      RAE & 0.63 & 0.57 & 0.6 & %
      0.0 & 0.0 & 0.0 & %
      \textbf{0.75} & 0.94 & 0.83 & %
      0.299 & 0.721\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.00      0.00      0.00       354
      %% negative       0.00      0.00      0.00       212
      %% neutral       0.62      1.00      0.77       930
      %% avg / total       0.39      0.62      0.48      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 0.00%
      %% Micro-Averaged F1-Score (All Classes): 62.1658%
      MVRNN & 0.0 & 0.0 & 0.0 & %
      0.0 & 0.0 & 0.0 & %
      0.62 & \textbf{1.0} & 0.77 & %
      0.0 & 0.622\\

      %% General Statistics:
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.20      0.03      0.05       354
      %% negative       0.07      0.01      0.02       212
      %% neutral       0.62      0.94      0.75       930
      %% avg / total       0.44      0.59      0.48      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 3.30%
      %% Micro-Averaged F1-Score (All Classes): 59.3583%
      RNTN & 0.2 & 0.03 & 0.05 & %
      0.07 & 0.01 & 0.02 & %
      0.62 & 0.94 & 0.75 & %
      0.033 & 0.594\\

      % Severyn Commands:
      % -----------------
      % cgsa_sentiment train -t severyn \
      % data/SB10k/preprocessed/train/*.tsv data/SB10k/preprocessed/dev/*.tsv
      %
      % cgsa_sentiment test -m cgsa/data/models/cgsa.model data/SB10k/preprocessed/test/*.tsv\
      % > data/SB10k/preprocessed/predicted/severyn/severyn.test
      %
      % cgsa_evaluate data/SB10k/preprocessed/test/ \
      % data/SB10k/preprocessed/predicted/severyn/severyn.test
      %
      % Severyn Results:
      % ----------------
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.00      0.00      0.00       354
      %% negative       0.00      0.00      0.00       212
      %% neutral       0.62      1.00      0.77       930
      %% avg / total       0.39      0.62      0.48      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 0.00%
      %% Micro-Averaged F1-Score (All Classes): 62.1658%
      SEV & 0.0 & 0.0 & 0.0 & %
      0.0 & 0.0 & 0.0 & %
      0.62 & \textbf{1.0} & 0.77 & %
      0.0 & 0.622\\

      %% Baziotis Commands:
      %% -----------------
      %% cgsa_sentiment train -t baziotis \
      %% data/SB10k/preprocessed/train/*.tsv data/SB10k/preprocessed/dev/*.tsv

      %% cgsa_sentiment test -m cgsa/data/models/cgsa.model data/SB10k/preprocessed/test/*.tsv\
      %% > data/SB10k/preprocessed/predicted/baziotis/baziotis.test

      %% cgsa_evaluate data/SB10k/preprocessed/test/ \
      %% data/SB10k/preprocessed/predicted/baziotis/baziotis.test

      %% Baziotis Results:
      %% ----------------
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.75      0.47      0.58       354
      %% negative       0.00      0.00      0.00       212
      %% neutral       0.71      0.98      0.83       930
      %% avg / total       0.62      0.72      0.65      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 29.07%
      %% Micro-Averaged F1-Score (All Classes): 71.9920%
      BAZ & 0.75 & 0.47 & 0.58 & %
      0.0 & 0.0 & 0.0 & %
      0.71 & 0.98 & 0.83 & %
      0.291 & 0.72\\

      % Lexicon-based Attention (1 BiLSTM) @b263b917dfe8755b10b9759f4ed0cecffa3534a1
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.72      0.58      0.64       354
      %% negative       0.00      0.00      0.00       212
      %% neutral       0.74      0.97      0.84       930
      %% avg / total       0.63      0.74      0.67      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 32.08%
      %% Micro-Averaged F1-Score (All Classes): 73.7299%
      LBA$^{(1)}$ & 0.72 & \textbf{0.58} & \textbf{0.64} & %
      0.0 & 0.0 & 0.0 & %
      0.74 & 0.97 & \textbf{0.84} & %
      \textbf{0.321} & \textbf{0.737}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.76      0.49      0.60       354
      %% negative       0.00      0.00      0.00       212
      %% neutral       0.72      0.98      0.83       930
      %% avg / total       0.62      0.72      0.65      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 29.79%
      %% Micro-Averaged F1-Score (All Classes): 72.2594%
      LBA$^{(2)}$ & \textbf{0.76} & 0.49 & 0.6 & %
      0.0 & 0.0 & 0.0 & %
      0.72 & 0.98 & 0.83 & %
      0.298 & 0.723\\\bottomrule
    \end{tabular}
    \egroup{}
    \caption[Results of DL-based MLSA methods]{Results of
      deep-learning--based MLSA methods\\ {\small Y\&C~--~\citet{Yessenalina:11},
        RAE~--~Recursive Auto-Encoder \cite{Socher:11},
        MVRNN~--~Matrix-Vector RNN \cite{Socher:12}, RNTN~--~Recursive
        Neural-Tensor Network \cite{Socher:13},
        SEV~--~\citet{Severyn:15}, BAZ~--~\citet{Baziotis:17},
        LBA$^{(1)}$~--~lexicon-based attention with one Bi-LSTM
        layer, LBA$^{(2)}$~--~lexicon-based attention with two Bi-LSTM
        layers}}\label{snt-cgsa:tbl:dl-res}
  \end{center}
\end{table}

The results of this evaluation are shown in
Table~\ref{snt-cgsa:tbl:dl-res}.  As we can see from the figures, the
LBA method performs fairly well, especially on the positive and
neutral classes where it achieves the best \F-benchmarks on both
datasets and also attains the highest overall micro-averaged \F-scores
on all test samples (0.662 on PotTS and 0.737 on SB10k).  Even though
our approach also yields the best macro-averaged result on the SB10k
set~(0.321~\F), it seems to face a major difficulty with the extreme
label skewness of this corpus, failing to predict any negative tweet
in the test set.  This problem, in general, appears to be an
insurmountable hurdle for almost all other compared systems,
especially the matrix-space, MVRNN, and convolutional approaches,
which eventually end up predicting only the most common neutral label
for all messages in this dataset.  A single notable exception to this
tendency is the recursive neural tensor approach of~\citet{Socher:13},
which succeeds in classifying some of the negative instances and also
predicts positive and neutral labels, but whose precision and recall
are still far below an acceptable level.

A similar, though less severe situation is also observed on the PotTS
corpus.  This time, the Y\&C, MVRNN, BAZ, and LBA$^{(2)}$ methods
lapse into always predicting only the most frequent positive class.
Other systems, however, perform much better, especially the approach
of~\citet{Severyn:15}, which does an extraordinarily good job at
classifying negative messages, reaching remarkable 0.46~\F{} on this
subset and also attaining the best macro-average score (0.608) on all
tweets due to its competitive performance on positive and neutral
microblogs.  Nevertheless, even the best-performing DL systems (SEV
and LBA) lag far behind the traditional supervised machine-learning
method of~\citet{Mohammad:13}, and barely outperform the lexicon-based
approach of~\citet{Hu:04} in terms of the micro-averaged \F{} on
SB10k.  Two possible explanations for these mediocre scores could be a
bad starting point of the parameters, which prevented the optimizers
from finding the optimal solution to the optimization objective, or an
insufficient amount of training data, which caused an extreme
overfitting of the training set, but poor generalization to unseen
examples.  We will now investigate both of these factors in detail.

% \section{Message-Level Sentiment Analysis Using Language and Domain
%   Adaptation}\label{sec:cgsa:domain-adaptation}

% One of the first works which pointed out the importance of domain
% adaptation for sentiment analysis was introduced by~\citet{Aue:05}.
% In their experiments, the authors trained separate SVM classifiers on
% four different document sets: movie reviews, book reviews, customer
% feedback from a product support service, and a feedback survey from a
% customer knowledge base; finding that each classifier performed best
% when applied to the same domain as it was trained on.  In order to
% find an optimal way of overcoming this domain specificity,
% \citet{Aue:05} tried out four different options:
% \begin{inparaenum}[(i)]
% \item\label{sent-cgsa:lst:rel-wrk1} training one classifier on all but
%   the target domain and applying it to the latter;
% \item using the same procedure as above, but limiting the features to
%   only those which also appeared in the target texts;
% \item taking an ensemble of individual classifiers each of which was
%   trained on a different data collection; and, finally,
% \item using a minimal subset of labeled in-domain data to train a
%   Na{\"i}ve Bayes system with the expectation-maximization algorithm
%   \cite[EM;][]{Dempster:77}.
% \end{inparaenum}
% The authors found that the ensemble and EM options worked best for
% their cross-domain task, achieving an accuracy of up to 82.39\% for
% the two-class prediction (positive vs negative) on new unseen text
% genres.

% Another notable milestone in the domain adaptation research was set
% by~\citet{Blitzer:07}.  Relying on their previous work on structural
% correspondence learning~\cite{Blitzer:07}, in which they used a set of
% \emph{pivot features} (features which frequently appeared in both
% target and source domains) to find an optimal correspondence of the
% remaining attributes,\footnote{In particular, the authors trained $m$
%   binary predictors for each of their $m$ pivot features in order to
%   find other attributes which frequently co-occurred with the pivots.
%   Afterwards, they composed these $m$ resulting weight vectors into a
%   single matrix $W := [\vec{w}_{1},\ldots,\vec{w}_{m}]$, took an SVD
%   decomposition of this matrix, and used the top $h$ left singular
%   vectors to translate source features to the new domain.} the authors
% refined their method by pre-selecting the pivots using their PMI
% scores and improving misaligned feature projections using a small set
% of labeled target examples.  With these modifications,
% \citeauthor{Blitzer:07} were able to reduce the average adaptation
% loss (the accuracy drop when transferring a classifier to a different
% domain) from 9.1 to 4.9~percent when testing a sentiment predictor on
% the domains of book, dvd, electical appliances, and kitchen reviews.

% Other important works on domain adaptation for opinion mining include
% those of~\citet{Read:05}, who pointed out that sentiment
% classification might not only depend on the domain but also on topic,
% time, and language style in which the text was written;
% \citet{Tan:07}, who proposed using the classifier trained on the
% source domain to classify unlabeled instances from the target genre,
% and then iteratively retrain the system on the enriched data set.
% Finally, \citet{Andreevskaia:08} proposed a combination of a lexicon-
% and ML-based systems, claiming that this ensemble would be more
% resistible to the domain shift than each of these classifiers on their
% own.

% Another line of research was introduced by~\citet{Glorot:11} who
% proposed stacked denoising autoencoders (SDA)---a neural network
% architecture in which an input vector $\vec{x}$ was first mapped to a
% smaller representation $\vec{x}'$ via some function
% $h: \vec{x}\mapsto\vec{x}'$, and then restored to its approximate
% original state via an inverse transformation
% $g: \vec{x}'\mapsto\vec{x}''\approx\vec{x}$.  In their experiments,
% the authors optimized the parameters of the functions $h$ and $g$ on
% both target and source data, getting approximate representations of
% instances from both data sets; and then trained a linear SVM
% classifier on the restored representations of the source instances,
% subsequently applying this classifier to the target domain.  This
% approach was further refined by~\citet{Chen:12} who analytically
% computed the reconstruction function~$g$, and used both original and
% restored features to predict the polarity labels of the target
% data.\footnote{Both approaches were trained tested on the Amazon
%   Review Corpus of~\citet{Blitzer:07}.}


% Further notable contributions to domain adaptation in general were
% made by~\citet{Daume:07} who proposed to replicate each extracted
% feature three times and train the first replication on both domains,
% the second repetion only on source, and the third copy only on target
% domain, for which he assumed a small subset of labeled examples was
% available; \citet{Yang:15} who trained neural embeddings of features,
% trying to predict which instance attributes frequently co-occured with
% each other;

\subsection{Word Embeddings}

As in the previous chapters, we decided to replace randomly
initialized word vectors in the very first layer of vector-based
neural networks with pretrained word2vec embeddings, keeping this
parameter fixed during the optimization.  As we can see from the
figures in Table~\ref{snt-cgsa:tbl:dl-res-word2vec}, this operation
leads to a significant improvement of the results for almost all
classifiers except for the recursive auto-encoder and convolutional
approach of~\citet{Severyn:15}, where it slightly lowers the
micro-averaged \F-score in the former case (from 0.605 to 0.55) and
considerably worsens the macro-averaged \F{} (from 0.608 to 0.36~\F)
of the latter system.  Nonetheless, even despite these exceptional
setbacks, the best observed macro-score increases from 0.608 to 0.64
on the PotTS dataset and almost doubles from 0.321 to 0.53 on the
SB10k data.  A similar situation is observed with the micro-averaged
\F{}, which rises from 0.662 to 0.69 on PotTS and also improves from
0.737 to 0.75 on the SB10k corpus.  Unfortunately, these improvements
usually come at the expense of a lower recall of the majority classes
(positive and neutral respectively), but the gains in the overall
metrics are generally much higher and, first of all, more important
than the losses in these single aspects.

\begin{table}[h]
  \begin{center}
    \bgroup \setlength\tabcolsep{0.1\tabcolsep}\scriptsize
    \begin{tabular}{p{0.162\columnwidth} % first columm
        *{9}{>{\centering\arraybackslash}p{0.074\columnwidth}} % next nine columns
        *{2}{>{\centering\arraybackslash}p{0.068\columnwidth}}} % last two columns
      \toprule
      \multirow{2}*{\bfseries Method} & %
      \multicolumn{3}{c}{\bfseries Positive} & %
      \multicolumn{3}{c}{\bfseries Negative} & %
      \multicolumn{3}{c}{\bfseries Neutral} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Macro\newline \F{}$^{+/-}$} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Micro\newline \F{}}\\
      \cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}

      & Precision & Recall & \F{} & %
      Precision & Recall & \F{} & %
      Precision & Recall & \F{} & & \\\midrule

      \multicolumn{12}{c}{\cellcolor{cellcolor}PotTS}\\
      % Commands:
      % cgsa_sentiment train -t rnn --w2v data/PotTS/preprocessed/train/*.tsv \
      % data/PotTS/preprocessed/dev/*.tsv
      % cgsa_sentiment -v test data/PotTS/preprocessed/test/*.tsv > \
      % data/PotTS/preprocessed/predicted/rnn/rnn.word2vec.test
      % cgsa_evaluate data/PotTS/preprocessed/test/ \
      % data/PotTS/preprocessed/predicted/rnn/rnn.word2vec.test
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.58      0.74      0.65       680
      %% negative       0.34      0.26      0.29       287
      %% neutral       0.59      0.46      0.52       558
      %% avg / total       0.54      0.55      0.54      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 47.36%
      %% Micro-Averaged F1-Score (All Classes): 54.8852%
      RAE & 0.58\negdelta{0.06} & 0.74\negdelta{0.04} & 0.65\negdelta{0.05} & %
      0.34\negdelta{0.04} & 0.26\posdelta{0.22} & 0.29\posdelta{0.21} & %
      0.59\posdelta{0.02} & 0.46\negdelta{0.22} & 0.52\negdelta{0.1} & %
      0.47\posdelta{0.08} & 0.55\negdelta{0.06}\\

      % Commands:
      % cgsa_sentiment train -t rntn --w2v data/PotTS/preprocessed/train/*.tsv \
      % data/PotTS/preprocessed/dev/*.tsv
      % cgsa_sentiment -v test data/PotTS/preprocessed/test/*.tsv > \
      % data/PotTS/preprocessed/predicted/rntn/rntn.word2vec.test
      % cgsa_evaluate data/PotTS/preprocessed/test/ \
      % data/PotTS/preprocessed/predicted/rntn/rntn.word2vec.test
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.48      0.77      0.59       680
      %% negative       0.33      0.03      0.06       287
      %% neutral       0.46      0.33      0.38       558
      %% avg / total       0.44      0.47      0.41      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 32.54%
      %% Micro-Averaged F1-Score (All Classes): 46.8852%
      RNTN & 0.48\posdelta{0.03} & 0.77\negdelta{0.1} & 0.59 & %
      0.33\posdelta{0.14} & 0.03\posdelta{0.01} & 0.06\posdelta{0.03} & %
      0.46\posdelta{0.14} & 0.33\posdelta{0.23} & 0.38\posdelta{0.01} & %
      0.33\posdelta{0.02} & 0.47\posdelta{0.04}\\

      % Commands:
      % cgsa_sentiment train -t severyn --w2v data/PotTS/preprocessed/train/*.tsv \
      % data/PotTS/preprocessed/dev/*.tsv
      % cgsa_sentiment -v test data/PotTS/preprocessed/test/*.tsv > \
      % data/PotTS/preprocessed/predicted/severyn/severyn.word2vec.test
      % cgsa_evaluate data/PotTS/preprocessed/test/ data/PotTS/preprocessed/predicted/severyn/severyn.word2vec.test
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.69      0.74      0.72       680
      %% negative       0.00      0.00      0.00       287
      %% neutral       0.58      0.84      0.69       558
      %% avg / total       0.52      0.64      0.57      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 35.78%
      %% Micro-Averaged F1-Score (All Classes): 63.6066%
      SEV & 0.69\negdelta{0.04} & 0.74\negdelta{0.05} & 0.72\negdelta{0.04} & %
      0.0\negdelta{0.41} & 0.0\negdelta{0.52} & 0.0\negdelta{0.46} & %
      0.58\negdelta{0.14} & 0.84\posdelta{0.29} & 0.69\posdelta{0.07} & %
      0.36\negdelta{0.25} & 0.64\negdelta{0.01}\\

      % Commands:
      % cgsa_sentiment train -t baziotis --w2v data/SB10k/preprocessed/train/*.tsv \
      % data/SB10k/preprocessed/dev/*.tsv
      % cgsa_sentiment -v test data/SB10k/preprocessed/test/*.tsv > \
      % data/SB10k/preprocessed/predicted/baziotis/baziotis.word2vec.test
      % cgsa_evaluate data/SB10k/preprocessed/test/ data/SB10k/preprocessed/predicted/baziotis/baziotis.word2vec.test
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.85      0.61      0.71       680
      %% negative       0.57      0.32      0.41       287
      %% neutral       0.55      0.87      0.68       558
      %% avg / total       0.69      0.65      0.64      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 55.88%
      %% Micro-Averaged F1-Score (All Classes): 65.0492%
      BAZ & 0.85\posdelta{0.4} & 0.61\negdelta{0.39} & 0.71\posdelta{0.09} & %
      0.57\posdelta{0.57} & 0.32\posdelta{0.32} & 0.41\posdelta{0.41} & %
      0.55\posdelta{0.55} & 0.87\posdelta{0.87} & 0.68\posdelta{0.68} & %
      0.56\posdelta{0.25} & 0.65\posdelta{0.2}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.86      0.60      0.71       680
      %% negative       0.61      0.46      0.53       287
      %% neutral       0.60      0.89      0.72       558
      %% avg / total       0.72      0.68      0.68      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 61.71%
      %% Micro-Averaged F1-Score (All Classes): 68.1311%
      LBA$^{(1)}$ & 0.86\posdelta{0.04} & 0.6\negdelta{0.13} & 0.71\negdelta{0.06} & %
      0.61\posdelta{0.61} & 0.46\posdelta{0.46} & 0.53\posdelta{0.53} & %
      0.6\posdelta{0.04} & 0.89\negdelta{0.03} & 0.72\posdelta{0.03} & %
      0.62\posdelta{0.23} & 0.68\posdelta{0.02}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.84      0.65      0.73       680
      %% negative       0.57      0.54      0.55       287
      %% neutral       0.63      0.82      0.72       558
      %% avg / total       0.71      0.69      0.69      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 64.22%
      %% Micro-Averaged F1-Score (All Classes): 69.2459%
      LBA$^{(2)}$ & 0.84\posdelta{0.39} & 0.65\negdelta{0.35} & 0.73\posdelta{0.11} & %
      0.57\posdelta{0.57} & 0.54\posdelta{0.54} & 0.55\posdelta{0.55} & %
      0.63\posdelta{0.63} & 0.82\posdelta{0.82} & 0.72\posdelta{0.72} & %
      0.64\posdelta{0.33} & 0.69\posdelta{0.24}\\

      \multicolumn{12}{c}{\cellcolor{cellcolor}SB10k}\\
      % Commands:
      % cgsa_sentiment train -t rnn --w2v data/SB10k/preprocessed/train/*.tsv \
      % data/SB10k/preprocessed/dev/*.tsv
      % cgsa_sentiment -v test data/SB10k/preprocessed/test/*.tsv > \
      % data/SB10k/preprocessed/predicted/rnn/rnn.word2vec.test
      % cgsa_evaluate data/SB10k/preprocessed/test/ \
      % data/SB10k/preprocessed/predicted/rnn/rnn.word2vec.test
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.61      0.56      0.58       354
      %% negative       0.29      0.01      0.02       212
      %% neutral       0.74      0.92      0.82       930
      %% avg / total       0.64      0.71      0.65      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 30.13%
      %% Micro-Averaged F1-Score (All Classes): 70.6551%
      RAE & 0.61\negdelta{0.02} & 0.56\negdelta{0.01} & 0.58\negdelta{0.02} & %
      0.29\posdelta{0.29} & 0.01\posdelta{0.01} & 0.02\posdelta{0.02} & %
      0.74\negdelta{0.01} & 0.92\negdelta{0.02} & 0.82\negdelta{0.01} & %
      0.3 & 0.71\negdelta{0.01}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.54      0.02      0.04       354
      %% negative       0.00      0.00      0.00       212
      %% neutral       0.63      1.00      0.77       930
      %% avg / total       0.52      0.62      0.49      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 1.91%
      %% Micro-Averaged F1-Score (All Classes): 62.4332%
      RNTN & 0.54\posdelta{0.34} & 0.02\negdelta{0.01} & 0.04\negdelta{0.01} & %
      0.0\negdelta{0.07} & 0.0\negdelta{0.01} & 0.0\negdelta{0.02} & %
      0.63\posdelta{0.01} & 1.0\posdelta{0.06} & 0.77\posdelta{0.02} & %
      0.02\negdelta{0.01} & 0.62\posdelta{0.03}\\

      % Commands:
      % cgsa_sentiment train -t severyn --w2v data/SB10k/preprocessed/train/*.tsv \
      % data/SB10k/preprocessed/dev/*.tsv
      % cgsa_sentiment -v test data/SB10k/preprocessed/test/*.tsv > \
      % data/SB10k/preprocessed/predicted/severyn/severyn.word2vec.test
      % cgsa_evaluate data/SB10k/preprocessed/test/ data/SB10k/preprocessed/predicted/severyn/severyn.word2vec.test
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.72      0.50      0.59       354
      %% negative       0.49      0.27      0.35       212
      %% neutral       0.75      0.92      0.82       930
      %% avg / total       0.71      0.72      0.70      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 46.86%
      %% Micro-Averaged F1-Score (All Classes): 72.4599%
      SEV & 0.72\posdelta{0.72} & 0.5\posdelta{0.5} & 0.59\posdelta{0.59} & %
      0.49\posdelta{0.49} & 0.27\posdelta{0.27} & 0.35\posdelta{0.35} & %
      0.75\negdelta{0.13} & 0.92\negdelta{0.08} & 0.82\posdelta{0.05} & %
      0.47\posdelta{0.47} & 0.73\posdelta{0.11}\\

      % Commands:
      % cgsa_sentiment train -t baziotis --w2v data/SB10k/preprocessed/train/*.tsv \
      % data/SB10k/preprocessed/dev/*.tsv
      % cgsa_sentiment -v test data/SB10k/preprocessed/test/*.tsv > \
      % data/SB10k/preprocessed/predicted/baziotis/baziotis.word2vec.test
      % cgsa_evaluate data/SB10k/preprocessed/test/ data/SB10k/preprocessed/predicted/baziotis/baziotis.word2vec.test
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.78      0.51      0.61       354
      %% negative       0.49      0.42      0.45       212
      %% neutral       0.78      0.91      0.84       930
      %% avg / total       0.74      0.75      0.73      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 53.15%
      %% Micro-Averaged F1-Score (All Classes): 74.5321%
      BAZ & 0.78\posdelta{0.03} & 0.51\posdelta{0.04} & 0.61\posdelta{0.03} & %
      0.49\posdelta{0.49} & 0.42\posdelta{0.42} & 0.45\posdelta{0.45} & %
      0.78\posdelta{0.07} & 0.91\negdelta{0.07} & 0.84\posdelta{0.01} & %
      0.53\posdelta{0.24} & 0.75\posdelta{0.03}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.84      0.42      0.56       354
      %% negative       0.50      0.28      0.36       212
      %% neutral       0.74      0.96      0.84       930
      %% avg / total       0.73      0.73      0.70      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 45.86%
      %% Micro-Averaged F1-Score (All Classes): 73.3289%
      LBA$^{(1)}$ & 0.84\posdelta{0.12} & 0.42\negdelta{0.16} & 0.56\negdelta{0.08} & %
      0.5\posdelta{0.5} & 0.28\posdelta{0.28} & 0.36\posdelta{0.36} & %
      0.74 & 0.96\negdelta{0.01} & 0.84 & %
      0.46\posdelta{0.14} & 0.73\posdelta{0.01}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.79      0.45      0.57       354
      %% negative       0.57      0.23      0.33       212
      %% neutral       0.74      0.96      0.84       930
      %% avg / total       0.73      0.74      0.70      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 45.04%
      %% Micro-Averaged F1-Score (All Classes): 73.6631%
      LBA$^{(2)}$ & 0.79\posdelta{0.03} & 0.45\negdelta{0.04} & 0.57\negdelta{0.03} & %
      0.57\posdelta{0.57} & 0.23\posdelta{0.23} & 0.33\posdelta{0.33} & %
      0.74\posdelta{0.02} & 0.96\negdelta{0.02} & 0.84\posdelta{0.01} & %
      0.45\posdelta{0.15} & 0.74\posdelta{0.02}\\\bottomrule

    \end{tabular}
    \egroup
    \caption[Results of DL-based MLSA methods with pretrained word2vec
      vectors]{Results of deep-learning--based MLSA methods with
      pretrained word2vec vectors}
    \label{snt-cgsa:tbl:dl-res-word2vec}
  \end{center}
\end{table}

In order to see whether these changes would be different if we
optimized word representations as well, we reran our experiments once
again, initializing word vectors with word2vec embeddings as before,
but allowing them to be updated during the training.  Moreover, to
approximate task-specific representations of words that were missing
from the training set, we also computed the optimal transformation
matrix for converting the original word2vec vectors into optimized
sentiment embeddings using the method of the ordinary least squares,
as we did in the previous chapters, and used this matrix to derive
task-specific vectors during the testing.

As suggested by the results in Table~\ref{snt-cgsa:tbl:dl-res-lstsq},
these modifications improve the results even further, setting a new
record of the macro-averaged \F-scores on the PotTS corpus (0.69~\F),
and pushing our LBA$^{(1)}$ system even above its most challenging
competitors.  A similar effect is also observed with other systems,
first of all BAZ and LBA$^{(2)}$, which yield similarly good results
for all polarities.  Nevertheless, like in the previous case, these
improvements usually cause a drop in the recall for the most frequent
class of the respective dataset, which is especially severe for the
system of \citeauthor{Baziotis:17} on the PotTS data (-0.28 on
positive messages) and the LBA$^{(1)}$ approach on the SB10k test set
(-0.17 on neutral tweets).  Furthermore, the convolutional system of
\citeauthor{Severyn:15} and recursive neural tensor approach
of~\citet{Socher:13} fail to predict any negative tweet on PotTS and
SB10k, respectively, which also leads to a notable drop of their
overall macro-\F--values.  These drops, however, are rather
exceptional, as the same system of \citeauthor{Severyn:15} shows an
extraordinary big boost of the results on the SB10k corpus (+0.45
macro-\F{} and +0.1 micro-\F--score), and the macro-averaged \F-values
of all recurrent methods also become twice as high as in the case of
randomly initialized word vectors.

\begin{table}[h]
  \begin{center}
    \bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
    \begin{tabular}{p{0.162\columnwidth} % first columm
        *{9}{>{\centering\arraybackslash}p{0.074\columnwidth}} % next nine columns
        *{2}{>{\centering\arraybackslash}p{0.068\columnwidth}}} % last two columns
      \toprule
      \multirow{2}*{\bfseries Method} & %
      \multicolumn{3}{c}{\bfseries Positive} & %
      \multicolumn{3}{c}{\bfseries Negative} & %
      \multicolumn{3}{c}{\bfseries Neutral} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Macro\newline \F{}$^{+/-}$} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Micro\newline \F{}}\\
      \cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}

      & Precision & Recall & \F{} & %
      Precision & Recall & \F{} & %
      Precision & Recall & \F{} & & \\\midrule

      \multicolumn{12}{c}{\cellcolor{cellcolor}PotTS}\\
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.61      0.61      0.61       680
      %% negative       0.22      0.01      0.03       287
      %% neutral       0.48      0.72      0.57       558
      %% avg / total       0.49      0.54      0.49      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 31.92%
      %% Micro-Averaged F1-Score (All Classes): 53.6393%
      RAE & 0.61\negdelta{0.03} & 0.61\negdelta{0.17} & 0.61\negdelta{0.09} & %
      0.22\negdelta{0.16} & 0.01\negdelta{0.03} & 0.03\negdelta{0.05} & %
      0.48\negdelta{0.09} & 0.72\negdelta{0.04} & 0.57\negdelta{0.05} & %
      0.32\negdelta{0.07} & 0.54\negdelta{0.07}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.45      0.82      0.59       680
      %% negative       0.24      0.06      0.10       287
      %% neutral       0.43      0.17      0.24       558
      %% avg / total       0.41      0.44      0.37      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 34.27%
      %% Micro-Averaged F1-Score (All Classes): 44.0656%
      RNTN & 0.45 & 0.82\negdelta{0.05} & 0.59 & %
      0.24\posdelta{0.05} & 0.06\negdelta{0.04} & 0.1\negdelta{0.07} & %
      0.43\posdelta{0.09} & 0.17\posdelta{0.07} & 0.24\posdelta{0.09} & %
      0.34\posdelta{0.03} & 0.44\negdelta{0.01}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.73      0.74      0.74       680
      %% negative       0.00      0.00      0.00       287
      %% neutral       0.56      0.84      0.68       558
      %% avg / total       0.53      0.64      0.58      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 36.86%
      %% Micro-Averaged F1-Score (All Classes): 64.0000%
      SEV & 0.73 & 0.74\negdelta{0.05} & 0.74\negdelta{0.02} & %
      0.0\negdelta{0.41} & 0.0\negdelta{0.52} & 0.0\negdelta{0.46} & %
      0.56\negdelta{0.16} & 0.84\posdelta{0.29} & 0.68\posdelta{0.06} & %
      0.37\negdelta{0.24} & 0.64\negdelta{0.01}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.82      0.72      0.77       680
      %% negative       0.62      0.49      0.55       287
      %% neutral       0.68      0.85      0.76       558
      %% avg / total       0.73      0.73      0.72      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 65.79%
      %% Micro-Averaged F1-Score (All Classes): 72.5902%
      BAZ & 0.82\posdelta{0.37} & 0.72\negdelta{0.28} & 0.77\posdelta{0.15} & %
      0.62\posdelta{0.62} & 0.49\posdelta{0.49} & 0.55\posdelta{0.55} & %
      0.68\posdelta{0.68} & 0.85\posdelta{0.85} & 0.76\posdelta{0.76} & %
      0.66\posdelta{0.35} & 0.73\posdelta{0.28}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.76      0.84      0.79       680
      %% negative       0.60      0.56      0.58       287
      %% neutral       0.75      0.68      0.72       558
      %% avg / total       0.73      0.73      0.73      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 68.80%
      %% Micro-Averaged F1-Score (All Classes): 72.7869%
      LBA$^{(1)}$ & 0.76\negdelta{0.06} & 0.84\posdelta{0.11} & 0.79\posdelta{0.02} & %
      0.6\posdelta{0.6} & 0.56\posdelta{0.56} & 0.58\posdelta{0.58} & %
      0.75\posdelta{0.19} & 0.68\negdelta{0.24} & 0.72\posdelta{0.03} & %
      0.69\posdelta{0.3} & 0.73\posdelta{0.07}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.84      0.73      0.78       680
      %% negative       0.57      0.48      0.53       287
      %% neutral       0.66      0.82      0.73       558
      %% avg / total       0.72      0.72      0.71      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 65.19%
      %% Micro-Averaged F1-Score (All Classes): 71.5410%
      LBA$^{(2)}$ & 0.84\posdelta{0.39} & 0.73\negdelta{0.27} & 0.78\posdelta{0.16} & %
      0.57\posdelta{0.57} & 0.48\posdelta{0.48} & 0.53\posdelta{0.53} & %
      0.66\posdelta{0.66} & 0.82\posdelta{0.82} & 0.73\posdelta{0.73} & %
      0.65\posdelta{0.34} & 0.72\posdelta{0.27}\\

      \multicolumn{12}{c}{\cellcolor{cellcolor}SB10k}\\
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.50      0.73      0.59       354
      %% negative       0.35      0.06      0.10       212
      %% neutral       0.80      0.80      0.80       930
      %% avg / total       0.66      0.68      0.65      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 34.89%
      %% Micro-Averaged F1-Score (All Classes): 68.1818%
      RAE & 0.5\negdelta{0.13} & 0.73\posdelta{0.16} & 0.59\negdelta{0.01} & %
      0.35\posdelta{0.35} & 0.06\posdelta{0.06} & 0.1\posdelta{0.1} & %
      0.8\posdelta{0.05} & 0.8\negdelta{0.14} & 0.8\negdelta{0.03} & %
      0.35\posdelta{0.15} & 0.68\negdelta{0.04}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.00      0.00      0.00       354
      %% negative       0.00      0.00      0.00       212
      %% neutral       0.62      1.00      0.77       930
      %% avg / total       0.39      0.62      0.48      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 0.00%
      %% Micro-Averaged F1-Score (All Classes): 62.1658%
      RNTN & 0.0\negdelta{0.02} & 0.0\negdelta{0.03} & 0.0\negdelta{0.05} & %
      0.0\negdelta{0.07} & 0.0\negdelta{0.01} & 0.0\negdelta{0.02} & %
      0.62 & 1.0\negdelta{0.06} & 0.77\negdelta{0.02} & %
      0.0\negdelta{0.03} & 0.62\posdelta{0.03}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.64      0.58      0.61       354
      %% negative       0.51      0.21      0.30       212
      %% neutral       0.76      0.89      0.82       930
      %% avg / total       0.70      0.72      0.70      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 45.42%
      %% Micro-Averaged F1-Score (All Classes): 72.2594%
      SEV & 0.64\posdelta{0.64} & 0.58\posdelta{0.58} & 0.61\posdelta{0.61} & %
      0.51\posdelta{0.51} & 0.21\posdelta{0.21} & 0.3\posdelta{0.3} & %
      0.76\posdelta{0.14} & 0.89\negdelta{0.11} & 0.82\posdelta{0.05} & %
      0.45\posdelta{0.45} & 0.72\posdelta{0.1}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.72      0.59      0.65       354
      %% negative       0.53      0.33      0.41       212
      %% neutral       0.79      0.91      0.84       930
      %% avg / total       0.74      0.75      0.74      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 52.92%
      %% Micro-Averaged F1-Score (All Classes): 75.2005%
      BAZ & 0.72\posdelta{0.03} & 0.59\posdelta{0.12} & 0.65\posdelta{0.07} & %
      0.53\posdelta{0.53} & 0.33\posdelta{0.33} & 0.41\posdelta{0.41} & %
      0.79\posdelta{0.08} & 0.91\negdelta{0.07} & 0.84\posdelta{0.01} & %
      0.53\posdelta{0.24} & 0.75\posdelta{0.03}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.60      0.72      0.66       354
      %% negative       0.47      0.42      0.44       212
      %% neutral       0.84      0.80      0.82       930
      %% avg / total       0.73      0.73      0.73      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 55.00%
      %% Micro-Averaged F1-Score (All Classes): 72.6604%
      LBA$^{(1)}$ & 0.6\negdelta{0.12} & 0.72\posdelta{0.14} & 0.66\posdelta{0.02} & %
      0.47\posdelta{0.47} & 0.42\posdelta{0.42} & 0.44\posdelta{0.44} & %
      0.84\posdelta{0.1} & 0.8\negdelta{0.17} & 0.82\posdelta{0.02} & %
      0.55\posdelta{0.23} & 0.73\negdelta{0.01}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.72      0.57      0.64       354
      %% negative       0.55      0.39      0.46       212
      %% neutral       0.79      0.90      0.84       930
      %% avg / total       0.74      0.75      0.74      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 54.77%
      %% Micro-Averaged F1-Score (All Classes): 75.1337%
      LBA$^{(2)}$ & 0.72\negdelta{0.04} & 0.57\posdelta{0.08} & 0.64\posdelta{0.04} & %
      0.55\posdelta{0.55} & 0.39\posdelta{0.39} & 0.46\posdelta{0.46} & %
      0.79\posdelta{0.07} & 0.9\negdelta{0.08} & 0.84\posdelta{0.01} & %
      0.55\posdelta{0.25} & 0.75\posdelta{0.03}\\\bottomrule
    \end{tabular}
    \egroup{}
    \caption[Results of DL-based MLSA methods with least-squares
      embeddings]{Results of deep-learning--based MLSA methods with
      least-squares embeddings}\label{snt-cgsa:tbl:dl-res-lstsq}
  \end{center}
\end{table}

\subsection{Error Analysis}

Before we proceed with the evaluation of the second factor (larger
training set), let us first analyze some errors that were specific to
each of the classifiers trained with the least-squares embeddings.

Since interpreting and understanding the results of deep learning
systems is a complex task due to a big number of model parameters and
unobvious correlations between them, we decided to use the
\textsc{Lime} package~\cite{Ribeiro:16}, a recently proposed
model-agnostic interpretation tool, to get a better intuition about
the reasons of the classifiers' decisions.  To derive an explanation
for a particular prediction, \textsc{Lime} randomly removes or
perturbs parts of the input (in our case, tokens), estimating which of
these modifications lead to the biggest changes in the output, and
assigns corresponding class-specific association scores to each of the
changed parts.  The higher this score, the more predictive is the
given feature for that particular label.  For the sake of vividness,
we have highlighted all tokens that, according to \textsc{Lime}, were
associated with the neutral class as white, marked negative attributes
with the \colorbox{blue!30}{blue} background, and highlighted
positively connoted words in \colorbox{green!30}{green}, reflecting
the respective association strength with a higher color brightness.

The first incorrect prediction shown in
Example~\ref{snt:cgsa:exmp:rae-error} was made by the RAE system
of~\citet{Socher:11}.  As we can see from the visualization, the model
correctly recognized the positive term ``gef\"allt'' (\emph{to like}),
but, unfortunately, this word is the only one which contributes to the
right decision, and its learned weight is obviously not enough to
outdo the effect of multiple neutral and negative items, such as
``Gr\"un'' (\emph{green}), ``Schwarz'' (\emph{black}), and most
surprisingly ``PosSmiley\%'' (\emph{PosSmiley\%}), which unexpectedly
is stronger associated with the negative semantic orientation than
with the positive class.  As a consequence of this, the classifier
erroneously predicts the \textsc{neutral} label for the whole message,
falling against the prevalence of allegedly objective terms.

\begin{example}[An Error Made by the RAE System]\label{snt:cgsa:exmp:rae-error}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
    \colorbox{white!5}{Gr\"un}\colorbox{blue!11}{-}\colorbox{white!25.7}{Schwarz}
    \colorbox{white!5}{in} \colorbox{white!2.5!blue!1}{meinem}
    Bundesland. \colorbox{green!10}{Gef\"allt}
    \colorbox{white!5!blue!4.5}{mir} \colorbox{white!5!blue!5}{doch}
    \colorbox{white!4.6!blue!5}{sehr}
    \colorbox{white!5!blue!5}{\%PosSmiley}}\\
  \noindent \colorbox{white!5}{Green}\colorbox{blue!11}{-}\colorbox{white!25.7}{Black} \colorbox{white!5}{in} \colorbox{white!2.5!blue!1}{my} state.  \colorbox{white!5!blue!5}{Yet}, \colorbox{white!5!blue!4.5}{I} \colorbox{green!10}{like} it \colorbox{white!4.6!blue!5}{so much} \colorbox{white!5!blue!5}{\%PosSmiley}\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{green3}{positive}}\\
 \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{black}{neutral*}}
\end{example}

A similar situation is also observed with the recurrent neural tensor,
whose sample error is shown in
Example~\ref{snt:cgsa:exmp:socher13-error}.  As we can see from the
analysis, the bias towards the neutral class is even more pronounced
this time, as virtually all of the terms in the tweet are highlighted
in white.  The only word which shows a minimal negative connotation is
``tumblr,'' which indeed appeared twice in a negative tweet, two times
in neutral messages, and once in a positive microblog in the training
corpus.  Nonetheless, even for this term the skewness towards the
neutral orientation is still ten times bigger than its association
with the negative polarity ($\expnumber{1.4}{-4}$ versus
$\expnumber{1.5}{-5}$), which can be explained by the general
prevalence of neutral messages in SB10k.

\begin{example}[An Error Made by the RNTN System]\label{snt:cgsa:exmp:socher13-error}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet: }} {\upshape
    \colorbox{blue!1.4!white!5}{tumblr} \colorbox{white!19}{people} \colorbox{white!32}{sind} \colorbox{white!5}{meine} \colorbox{white!30}{lieblings} \colorbox{white!19}{people} \colorbox{white!8}{\%PosSmiley}}\\
  \noindent \colorbox{blue!1.4!white!5}{tumblr} \colorbox{white!19}{people} \colorbox{white!32}{are} \colorbox{white!5}{my} \colorbox{white!30}{favorite} \colorbox{white!19}{people} \colorbox{white!8}{\%PosSmiley}\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{green3}{positive}}\\
 \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{black}{neutral*}}
\end{example}

A slightly different behavior is shown by the method
of~\citet{Severyn:15} on the PotTS corpus.  This time, we can see at
least two clearly positive words (``ist'' [\emph{is}] and ``Freund''
[\emph{friend}]).  However, the former of these terms is an auxiliary
copular verb, which can hardly express any polarity, since it usually
plays an auxiliary role and lacks any distinct lexical meaning.
Nevertheless, the latter word (``Freund'' [\emph{friend}]) indeed
conveys a positive feeling of its prepositional argument (``Iran'')
towards the subject of the sentence (``Syrien'' [\emph{Syria}]), but
this positive effect is nullified by the author's statement that this
friendship poses a problem. Unfortunately, the word ``Problem''
(\emph{problem}) is recognized only as a neutral marker, just like
many other terms in this microblog.

\begin{example}[An Error Made by the SEV System]\label{snt:cgsa:exmp:severyn-error}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
    \colorbox{white!7.4}{Syrien} \colorbox{green!8}{ist}
    \colorbox{green!11}{Freund} \colorbox{white!6.6}{von}
    \colorbox{white!40.5}{Iran}, das \colorbox{green!8}{ist}
    \colorbox{white!12.7}{das}
    \colorbox{green!1.5}{Problem}\colorbox{green!3}{!}
    \colorbox{blue!0.000005!white!8}{annewill}}\\
  \noindent \colorbox{white!7.4}{Syria} \colorbox{green!8}{is} a \colorbox{green!11}{friend} \colorbox{white!6.6}{of} \colorbox{white!40.5}{Iran}. That\colorbox{green!8}{'s} \colorbox{white!12.7}{the} \colorbox{green!1.5}{problem}\colorbox{green!3}{!} \colorbox{blue!0.000005!white!8}{annewill}\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{midnightblue}{negative}}\\
 \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{black}{neutral*}}
\end{example}

In Example~\ref{snt:cgsa:exmp:baziotis-error}, we can see another
error made by the system of~\citet{Baziotis:17}.  This time, again, we
observe the prevalence of positive and neutral items, with the only
exception being the possessive pronoun ``meinen'' (\emph{my}), which,
according to the classifier, indicates negative polarity.  Apart from
this term, we also can notice several inaccuracies at recognizing
positive and neutral features: For example, the pronominal adverb
``darin'' (\emph{in it}) is the strongest positive trait, whose
predictiveness is even higher than the scores of the words ``singen''
(\emph{to sing}) and ``Liebeslied'' (\emph{love song}).  This
contradicts the fact that pronominal adverbs by themselves do not
express any semantic orientation, all the more as in this case the
antecedent of the adverb (the noun ``Kleiderschrank''
[\emph{wardrobe}]) is recognized as a neutral item.  On the other
hand, the modal verb ``wollte'' (\emph{wanted}) is considered as an
objective term, although it has a slight positive connotation as it
expresses a wish of the author.

\begin{example}[An Error Made by the BAZ System]\label{snt:cgsa:exmp:baziotis-error}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape \colorbox{white!1.4}{Wollte} \colorbox{blue!7.7}{meinen} \colorbox{white!3.6}{Kleiderschrank} \colorbox{blue!1.7}{aufr\"aumen} \ldots \colorbox{white!1.2}{sitze} \colorbox{green!4.6}{nun} \colorbox{green!31.5}{darin} \colorbox{green!2.8}{und} \colorbox{green!29.7}{singe} \colorbox{green!15.2}{Liebeslieder}\ldots}\\
  \noindent \colorbox{white!1.4}{Wanted} to \colorbox{blue!1.7}{clean up} \colorbox{blue!7.7}{my} \colorbox{white!3.6}{wardrobe}\ldots \colorbox{green!4.6}{Now} \colorbox{white!1.2}{sitting} \colorbox{green!31.5}{in it} \colorbox{green!2.8}{and} \colorbox{green!29.7}{singing} \colorbox{green!15.2}{love songs}\ldots\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{black}{neutral}}\\
 \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{green3}{positive*}}
\end{example}

Finally, Example~\ref{snt:cgsa:exmp:lba-error} shows an incorrect
prediction of our lexicon-based attention system.  In contrast to the
previous two methods, the positive information is much more condensed
in this case and represented by a single term ``super.''
Surprisingly, this term outweighs a whole bunch of neutral features
such as ``gerade'' (\emph{right now}), ``Lust haben'' (\emph{to be up
  to}), ``was'' (\emph{something}) etc.  Admittedly, the first part of
this message indeed expresses a positive attitude of the author, but
this effect is invalidated by the second clause, which shows the
impossibility of that wish.

\begin{example}[An Error Made by the LBA System]\label{snt:cgsa:exmp:lba-error}
  \noindent\textup{\bfseries\textcolor{darkred}{Tweet:}} {\upshape
    \colorbox{green!0.5!blue!0.4}{Gerade} \colorbox{green!89}{super} \colorbox{blue!0.3}{Lust}, mit \colorbox{white!2}{Carls} Haaren \colorbox{white!0.6}{was} zu \colorbox{green!1}{machen} \colorbox{green!0.3}{aber} \colorbox{white!2}{ca} 300 \colorbox{white!1}{km}
    \colorbox{white!1}{Distanz} halten \colorbox{blue!0.3}{mich} davon \colorbox{white!1}{ab}.}\\
  \noindent\colorbox{green!89}{Super} \colorbox{blue!0.3}{up to}
  \colorbox{green!1}{do} \colorbox{white!0.6}{something} with
  \colorbox{white!2}{Carl}'s hair \colorbox{green!0.5!blue!0.4}{right
    now}, \colorbox{green!0.3}{but} \colorbox{white!2}{ca.} 300
  \colorbox{white!1}{km} \colorbox{white!1}{distance} keep
  \colorbox{blue!0.3}{me} \colorbox{white!1}{off} from
  this.\\[\exampleSep]
  \noindent\textup{\bfseries\textcolor{darkred}{Gold Label:}}\hspace*{4.3em}\textbf{%
    \upshape\textcolor{black}{neutral}}\\
 \noindent\textup{\bfseries\textcolor{darkred}{Predicted Label:}}\hspace*{2em}\textbf{%
    \upshape\textcolor{green3}{positive*}}
\end{example}


\section{Evaluation}

Now that we have familiarized ourselves with the peculiarities and
results of the most prominent sentiment analysis approaches from all
method groups (lexicon-, machine-learning-- and deep-learning--based
ones), let us have a closer look at how changing different common
parameters of these methods might affect their performance.  In
particular, we would like to see whether increasing the amount of the
training data, switching to a different type of sentiment lexicon, or
using unnormalized text as input would improve or, vice versa, lower
the classification scores.

\subsection{Weak Supervision}

The first avenue that we are going to explore in this evaluation is
the effect of weakly supervised data---an additional collection of
training tweets that have been automatically labeled with sentiment
tags based on the occurrence of some sufficiently reliable formal
criteria, such as emoticons or hashtags.

Among the first who proposed the idea of training a sentiment
classifier on a larger corpus of automatically annotated messages was
\citet{Read:05}, who gathered a set of 766,000 Usenet posts containing
frownies or smileys, assigned a polarity label to each of these posts,
judging by the type of the emoticons, and subsequently used a subset
of these documents (22,000 posts) to optimize a Na\"{\i}ve Bayes and
SVM system.  Even though these classifiers could achieve a
considerable accuracy (up to 70\%) on predicting noisy labels of the
remaining posts, they could not generalize to texts from other genres
(movie reviews and newswire articles) where they hardly outperformed
the random-chance baseline. With the onset of the Twitter era, this
idea of weak supervision has experienced its renaissance with the
works of \citet{Go:09}, \citet{Pak:10}, and \citet{Barbosa:10}.

In order to check the effect of such noisily annotated data on our
tested methods, we also automatically labeled all messages from the
German Twitter Snapshot~\cite{Scheffler:14} based on the occurrences
of smileys: In particular, we considered a microblog as positive if
its normalized version contained the token \texttt{\%PositiveSmiley}
with no other facial expressions.  Likewise, we regarded a message as
negative if the only emoticon in this tweet was
\texttt{\%NegativeSmiley}.  We skipped all posts that contained both
types of smileys, and assigned the rest of the messages to the neutral
class. (A detailed breakdown of the final distribution is given in
Table~\ref{snt-cgsa:tbl:corp-dist} at the beginning of this chapter.)

Since it was impossible to utilize the whole snapshot for the training
due to limited computational resources (only reading the dataset into
memory would require 9.3Gb RAM, not to mention the space required for
storing the embeddings and features), we confined ourselves to one
sixth of these data, which still resulted in 4~M messages.
Furthermore, to mitigate the extreme skewness of this corpus, we
downsampled positive and neutral tweets to get an equal number of
instances for all classes (59,000 microblogs for each polarity) and
used these examples in addition to the manually analyzed PotTS and
SB10k tweets.

Since lexicon-based approaches were mostly independent of the training
set, we decided to rerun our experiments only with ML- and fastest
DL-based methods (RNN, SEV, BAZ, and LBA),\footnote{In all subsequent
  evaluation experiments with DL-based systems, we will use
  pre-trained word2vec vectors (if applicable) with the least-squares
  fallback, and compare the results of these approaches to the
  respective scores in Table~\ref{snt-cgsa:tbl:dl-res-lstsq}.} which
still incurred running times up to five days for some systems.  The
results of this evaluation are shown in
Table~\ref{snt-cgsa:tbl:weak-supervision}.

\begin{table}[h]
  \begin{center}
    \bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
    \begin{tabular}{p{0.162\columnwidth} % first columm
        *{9}{>{\centering\arraybackslash}p{0.074\columnwidth}} % next nine columns
        *{2}{>{\centering\arraybackslash}p{0.068\columnwidth}}} % last two columns
      \toprule
      \multirow{2}*{\bfseries Method} & %
      \multicolumn{3}{c}{\bfseries Positive} & %
      \multicolumn{3}{c}{\bfseries Negative} & %
      \multicolumn{3}{c}{\bfseries Neutral} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Macro\newline \F{}$^{+/-}$} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Micro\newline \F{}}\\
      \cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}

      & Precision & Recall & \F{} & %
      Precision & Recall & \F{} & %
      Precision & Recall & \F{} & & \\\midrule

      \multicolumn{12}{c}{\cellcolor{cellcolor}PotTS}\\
      %% (With balancing)
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.80      0.34      0.48       680
      %% negative       0.20      0.29      0.24       287
      %% neutral       0.53      0.79      0.63       558
      %% avg / total       0.59      0.49      0.49      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 35.67%
      %% Micro-Averaged F1-Score (All Classes): 49.4426%

      GMN & 0.8\posdelta{0.13} & 0.34\negdelta{0.39} & 0.48\negdelta{0.22} & %
       0.2\negdelta{0.15} & 0.29\posdelta{0.14} & 0.24\negdelta{0.03} & %
       0.53\negdelta{0.07} & 0.79\posdelta{0.07} & 0.63\negdelta{0.03} & %
       0.36\negdelta{0.01} & 0.49\negdelta{0.12}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.86      0.59      0.70       680
      %% negative       0.31      0.39      0.35       287
      %% neutral       0.55      0.68      0.61       558
      %% avg / total       0.64      0.59      0.60      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 52.30%
      %% Micro-Averaged F1-Score (All Classes): 58.6230%

      MHM & 0.86\posdelta{0.07} & 0.59\negdelta{0.18} & 0.7\negdelta{0.08} & %
      0.31\negdelta{0.27} & 0.39\negdelta{0.17} & 0.35\negdelta{0.22} & %
      0.55\negdelta{0.18} & 0.68\negdelta{0.08} & 0.61\negdelta{0.13} & %
      0.52\negdelta{0.15} & 0.59\negdelta{0.14}\\

      %% (With balancing)
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.86      0.60      0.71       680
      %% negative       0.26      0.31      0.28       287
      %% neutral       0.53      0.68      0.59       558
      %% avg / total       0.63      0.57      0.59      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 49.62%
      %% Micro-Averaged F1-Score (All Classes): 57.3770%

      GNT & 0.86\posdelta{0.15} & 0.6\negdelta{0.2} & 0.71\negdelta{0.04} & %
      0.26\negdelta{0.29} & 0.31\negdelta{0.14} & 0.28\negdelta{0.22} & %
      0.53\negdelta{0.15} & 0.68\negdelta{0.05} & 0.59\negdelta{0.06} & %
      0.5\negdelta{0.12} & 0.57\negdelta{0.1}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.68      0.31      0.43       680
      %% negative       0.25      0.46      0.32       287
      %% neutral       0.49      0.61      0.54       558
      %% avg / total       0.53      0.45      0.45      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 37.66%
      %% Micro-Averaged F1-Score (All Classes): 44.8525
      RAE & 0.68\posdelta{0.07} & 0.31\negdelta{0.3} & 0.43\negdelta{0.18} & %
      0.25\posdelta{0.03} & 0.46\posdelta{0.45} & 0.32\posdelta{0.29} & %
      0.49\posdelta{0.01} & 0.61\negdelta{0.11} & 0.54\negdelta{0.03} & %
      0.38\posdelta{0.06} & 0.45\negdelta{0.09}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.87      0.51      0.64       680
      %% negative       0.27      0.49      0.35       287
      %% neutral       0.55      0.58      0.56       558
      %% avg / total       0.64      0.53      0.56      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 49.38%
      %% Micro-Averaged F1-Score (All Classes): 53.2459%
      SEV & 0.87\posdelta{0.14} & 0.51\negdelta{0.23} & 0.64\negdelta{0.1} & %
      0.27\posdelta{0.27} & 0.49\posdelta{0.49} & 0.35\posdelta{0.35} & %
      0.55\negdelta{0.01} & 0.58\negdelta{0.26} & 0.56\negdelta{0.12} & %
      0.49\posdelta{0.12} & 0.53\negdelta{0.11}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.00      0.00      0.00       680
      %% negative       0.19      1.00      0.32       287
      %% neutral       0.00      0.00      0.00       558
      %% avg / total       0.04      0.19      0.06      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 15.84%
      %% Micro-Averaged F1-Score (All Classes): 18.8197%
      BAZ & 0.0\negdelta{0.82} & 0.0\negdelta{0.72} & 0.0\negdelta{0.77} & %
      0.19\negdelta{0.43} & 1.0\posdelta{0.51} & 0.32\negdelta{0.23} & %
      0.0\negdelta{0.68} & 0.0\negdelta{0.85} & 0.0\negdelta{0.76} & %
      0.16\negdelta{0.5} & 0.19\negdelta{0.43}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.48      0.88      0.62       680
      %% negative       0.25      0.23      0.24       287
      %% neutral       0.00      0.00      0.00       558
      %% avg / total       0.26      0.44      0.32      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 43.00%
      %% Micro-Averaged F1-Score (All Classes): 43.6721%
      LBA$^{(1)}$ & 0.48\negdelta{0.28} & 0.88\posdelta{0.04} & 0.62\negdelta{0.17} & %
      0.25\negdelta{0.35} & 0.23\negdelta{0.33} & 0.24\negdelta{0.34} & %
      0.0\negdelta{0.75} & 0.0\negdelta{0.68} & 0.0\negdelta{0.72} & %
      0.43\negdelta{0.26} & 0.44\negdelta{0.29}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.91      0.08      0.14       680
      %% negative       0.19      0.99      0.32       287
      %% neutral       0.00      0.00      0.00       558
      %% avg / total       0.44      0.22      0.12      1525
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 23.29%
      %% Micro-Averaged F1-Score (All Classes): 22.0984%
      LBA$^{(2)}$ & 0.91\posdelta{0.07} & 0.08\negdelta{0.65} & 0.14\negdelta{0.64} & %
      0.19\negdelta{0.38} & 0.99\posdelta{0.51} & 0.32\negdelta{0.21} & %
      0.0\negdelta{0.66} & 0.0\negdelta{0.82} & 0.0\negdelta{0.73} & %
      0.23\negdelta{0.42} & 0.22\negdelta{0.5}\\

      \multicolumn{12}{c}{\cellcolor{cellcolor}SB10k}\\

      %% General Statistics: (with balancing)
      %% precision    recall  f1-score   support
      %% positive       0.71      0.27      0.40       354
      %% negative       0.24      0.11      0.15       212
      %% neutral       0.71      0.96      0.82       930
      %% avg / total       0.64      0.68      0.62      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 27.42%
      %% Micro-Averaged F1-Score (All Classes): 67.9144%

      GMN & 0.71\posdelta{0.06} & 0.27\negdelta{0.18} & 0.4\negdelta{0.13} & %
      0.24\negdelta{0.14} & 0.11\posdelta{0.03} & 0.15\posdelta{0.02} & %
      0.71\negdelta{0.01} & 0.96\posdelta{0.03} & 0.82\negdelta{0.01} & %
      0.27\negdelta{0.06} & 0.68\negdelta{0.02}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.77      0.40      0.53       354
      %% negative       0.61      0.10      0.18       212
      %% neutral       0.71      0.97      0.82       930
      %% avg / total       0.71      0.71      0.66      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 35.13%
      %% Micro-Averaged F1-Score (All Classes): 71.1898%
      MHM & 0.77\posdelta{0.06} & 0.4\negdelta{0.25} & 0.53\negdelta{0.15} & %
      0.61\negdelta{0.1} & 0.1\negdelta{0.3} & 0.18\negdelta{0.27} & %
      0.71\negdelta{0.09} & 0.97\negdelta{0.1} & 0.82\negdelta{0.02} & %
      0.35\negdelta{0.21} & 0.71\negdelta{0.04}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.77      0.39      0.52       354
      %% negative       0.25      0.13      0.17       212
      %% neutral       0.71      0.92      0.80       930
      %% avg / total       0.66      0.68      0.64      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 34.42%
      %% Micro-Averaged F1-Score (All Classes): 68.1150%

      GNT & 0.77\posdelta{0.1} & 0.39\negdelta{0.23} & 0.52\negdelta{0.12} & %
      0.25\negdelta{0.19} & 0.13\negdelta{0.15} & 0.17\negdelta{0.17} & %
      0.71\negdelta{0.07} & 0.92\posdelta{0.05} & 0.8\negdelta{0.02} & %
      0.34\negdelta{0.15} & 0.68\negdelta{0.04}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.44      0.27      0.34       354
      %% negative       0.24      0.59      0.34       212
      %% neutral       0.78      0.62      0.69       930
      %% avg / total       0.62      0.54      0.56      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 33.70%
      %% Micro-Averaged F1-Score (All Classes): 53.6096%
      RAE & 0.44\negdelta{0.06} & 0.27\negdelta{0.51} & 0.34\negdelta{0.25} & %
      0.24\negdelta{0.11} & 0.59\posdelta{0.53} & 0.34\posdelta{0.24} & %
      0.78\negdelta{0.02} & 0.62\negdelta{0.18} & 0.69\negdelta{0.11} & %
      0.34\negdelta{0.01} & 0.54\negdelta{0.14}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.64      0.39      0.49       354
      %% negative       0.34      0.12      0.18       212
      %% neutral       0.70      0.90      0.78       930
      %% avg / total       0.63      0.67      0.63      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 33.30%
      %% Micro-Averaged F1-Score (All Classes): 66.8449%
      SEV & 0.64 & 0.39\negdelta{0.19} & 0.49\negdelta{0.12} & %
      0.34\negdelta{0.17} & 0.12\negdelta{0.09} & 0.18\negdelta{0.12} & %
      0.7\negdelta{0.06} & 0.9\posdelta{0.01} & 0.78\negdelta{0.04} & %
      0.33\negdelta{0.12} & 0.69\negdelta{0.03}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.24      1.00      0.38       354
      %% negative       0.00      0.00      0.00       212
      %% neutral       0.00      0.00      0.00       930
      %% avg / total       0.06      0.24      0.09      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 19.14%
      %% Micro-Averaged F1-Score (All Classes): 23.6631%
      BAZ & 0.24\negdelta{0.48} & 1.0\posdelta{0.41} & 0.38\negdelta{0.27} & %
      0.0\negdelta{0.53} & 0.0\negdelta{0.33} & 0.0\negdelta{0.41} & %
      0.0\negdelta{0.79} & 0.0\negdelta{0.91} & 0.0\negdelta{0.84} & %
      0.19\negdelta{0.34} & 0.24\negdelta{0.51}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.64      0.43      0.52       354
      %% negative       0.59      0.09      0.16       212
      %% neutral       0.71      0.93      0.80       930
      %% avg / total       0.68      0.69      0.64      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 33.59%
      %% Micro-Averaged F1-Score (All Classes): 69.4519%
     LBA$^{(1)}$ & 0.64\posdelta{0.04} & 0.43\negdelta{0.29} & 0.52\negdelta{0.14} & %
      0.59\posdelta{0.12} & 0.09\negdelta{0.33} & 0.16\negdelta{0.28} & %
      0.71\negdelta{0.13} & 0.93\posdelta{0.13} & 0.8\negdelta{0.02} & %
      0.34\negdelta{0.21} & 0.69\negdelta{0.04}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.00      0.00      0.00       354
      %% negative       0.14      1.00      0.25       212
      %% neutral       0.00      0.00      0.00       930
      %% avg / total       0.02      0.14      0.04      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 12.41%
      %% Micro-Averaged F1-Score (All Classes): 14.1711%
      LBA$^{(2)}$ & 0.0\negdelta{0.72} & 0.0\negdelta{0.57} & 0.0\negdelta{0.64} & %
      0.14\negdelta{0.41} & 1.0\posdelta{0.61} & 0.25\negdelta{0.21} & %
      0.0\negdelta{0.79} & 0.0\negdelta{0.9} & 0.0\negdelta{0.84} & %
      0.12\negdelta{0.43} & 0.14\negdelta{0.61}\\\bottomrule
    \end{tabular}
    \egroup{}
    \caption[Results of MLSA methods with weak supervision]{
      Results of MLSA methods with weakly supervised data}\label{snt-cgsa:tbl:weak-supervision}
  \end{center}
\end{table}

As we can see from the scores, apart from improved precision of
positive tweets and higher recall of negative microblogs, adding
noisily labeled messages to the training set has a strong negative
effect on the results of all methods, with the biggest drops
demonstrated by the approach of~\citeauthor{Baziotis:17} ($-0.5$
macro-\F{} and $-0.43$ micro-\F{} on the PotTS corpus; $-0.34$
macro-\F{} and $-0.51$ micro-\F{} on the SB10k dataset) and our own
LBA$^{(2)}$ solution ($-0.42$ macro-\F{}-score and $-0.5$ micro-\F{}
on the PotTS test set; $-0.43$ macro-\F{} and $-0.61$ micro-\F{} on
the SB10k data), which both fail to predict any neutral message on
PotTS and always assign the same polarity to all SB10k tweets.  Less
severe, but still substantial degradation is also observed with the
machine-learning systems of \citeauthor{Mohammad:09} and
\citeauthor{Guenther:13} as well as our DL-based LBA$^{(1)}$ method,
whose macro-averaged \F{}-scores go down by 0.15, 0.12, and 0.26
points on the former corpus and sink by 0.21, 0.15, 0.21,
respectively, on the latter dataset.  The micro-averaged \F{}-results
of these methods, however, decrease to a much smaller degree, since
the main drops happen on the negative class, which is by far the least
represented polarity in both corpora.  The micro-averages of the
remaining systems seem to be affected even less, but are still worse
than the results obtained without the snapshot data.  We hypothesize
that the main reason for this decrease is a substantial difference
between the class distributions in noisily annotated training tweets
and manually labeled test sets, which overly bias classifiers'
predictions.


\subsection{Lexicons}\label{cgsa:subsec:eval:lexicons}

Another factor that could significantly affect the results of some
systems was the sentiment lexicon that these systems used either
directly, for computing the polarity of a message (\eg{} lexicon-based
approaches), or indirectly, as features or attention scores (\eg{} ML-
and DL-based techniques).  To estimate the effect of this resource, we
successively replaced the lexicons that we used in our previous
experiments with other polarity lists presented in
Chapter~\ref{chap:snt:lex}, and recomputed the scores of the tested
systems.

As we can see in Figure~\ref{cgsa:fig:potts-lexicon-effect}, the
system of~\citet{Mohammad:13} and our own lexicon-based attention
approach clearly outperform all other competitors on the PotTS corpus
independent of the lexicon they use.  The only method that comes at
least close to their results is the ML-based classifier
of~\citet{Guenther:14}, which is still almost 5\% below the average
macro-\F{} of these two classifiers.  The same also applies to the
micro-\F-scores, where the solution of~\citet{Guenther:14} loses
almost 3\% on average to the two top performers.  Regarding the
differences between the MHM and LBA themselves, we can observe a
rather mixed relation: The approach of~\citet{Mohammad:13} yields
better macro-averaged \F{}-results with the lexicons
of~\citet{Esuli:05}, \citet{Vo:16}, and \citet{Clematide:10}, but
falls against LBA when used with the polarity lists
of~\citet{Blair-Goldensohn:08}, \citet{Waltinger:10}, \citet{Hu:04},
\citet{Kiritchenko:14}, \citet{Rao:09}, \citet{Takamura:05},
\citet{Tang:14}, and \citet{Velikovich:10} as well as the NWE-based
\textsc{LinProj} and \textsc{PCA} lexicons.  Moreover, when trained
with the polarity list of \citeauthor{Tang:14} and our
\textsc{LinProj} lexicon, the LBA system achieves the best overall
macro-\F{} on this corpus.

%% This figure was generated using the iPython notebook `notebooks/cgsa.ipynb`.
\begin{figure*}
{
\centering
\begin{subfigure}{.5\textwidth}
  \centering
  \includegraphics[width=\linewidth]{img/cgsa_potts_macro_lexicons.png}
  \caption{\texttt{Macro-\F{}}}\label{cgsa:fig:potts-lexicon-macro}
\end{subfigure}%
\begin{subfigure}{.5\textwidth}
  \centering
  \includegraphics[width=\linewidth]{img/cgsa_potts_micro_lexicons.png}
  \caption{\texttt{Micro-\F{}}}\label{cgsa:fig:potts-lexicon-micro}
\end{subfigure}
}
\caption[MLSA results on PotTS with different lexicons]{Results of
  MLSA methods with different lexicons on the PotTS
  corpus}\label{cgsa:fig:potts-lexicon-effect}
\end{figure*}

These results, however, look slightly differently when we consider the
micro-averaged scores. This time, the system
of~\citeauthor{Mohammad:13} outperforms our solution in eight out of
twenty cases, but performs worse than LBA with four other polarity
lists (GPC, KIR, RR$_{\textrm{mincut}}$, and VEL).  Nevertheless, our
approach still reaches the best overall observed score (0.73) with
three tested resources (GPC, \textsc{LinProj}, and VEL).

Regarding the performance of single lexicons, we can see that the best
results are achieved with the manually curated
\textsc{SentiWS}~\cite{Remus:10} and Zurich Polarity
List~\cite{Clematide:10}, followed by the dictionary-based approaches
of~\citet{Blair-Goldensohn:08} and \citet{Rao:09}.  The method of the
nearest centroids vice versa appears to be of the lowest utility for
almost all systems, even though it demonstrated quite acceptable
scores in our initial intrinsic evaluation.

A similar situation also holds for the SB10k corpus, where the
ML-based approaches of~\citet{Mohammad:13} and \citet{Guenther:14} and
our proposed LBA system outperform all other methods in terms of both
macro- and micro-averaged \F{}-scores.  This time, however, the
average difference between the macro-results of LBA and GNT is much
smaller and amounts to only 0.02\% in favor of LBA, which again
achieves the best overall macro-\F{} (0.58) in combination with the
min-cut lexicon of~\citet{Rao:09}.  Unfortunately, our system clearly
falls against the latter classifier with respect to the micro-averaged
scores, performing worse than it in 16 out of 20 experiments.

The effect of single lexicons is also less pronounced than in the
PotTS case, as all of the tested polarity lists show a more or less
similar behavior, especially regarding the macro-averaged \F{}-score.
In terms of the micro-\F{}, however, we can observe that
dictionary-based lists, especially those of~\citet{Awadallah:10},
\citet{Blair-Goldensohn:08}, \citet{Hu:04}, and \citet{Kim:04}, lead
to generally better scores than corpus- and NWE-based resources.

%% This figure was generated using the iPython notebook `notebooks/cgsa.ipynb`.
\begin{figure*}
{
\centering
\begin{subfigure}{.5\textwidth}
  \centering
  \includegraphics[width=\linewidth]{img/cgsa_sb10k_macro_lexicons.png}
  \caption{\texttt{Macro-\F}}\label{cgsa:fig:sb10k-lexicon-macro}
\end{subfigure}%
\begin{subfigure}{.5\textwidth}
  \centering
  \includegraphics[width=\linewidth]{img/cgsa_sb10k_micro_lexicons.png}
  \caption{\texttt{Micro-\F}}\label{cgsa:fig:sb10k-lexicon-micro}
\end{subfigure}
}
\caption[MLSA results on SB10k with different lexicons]{Results of MLSA
  methods with different lexicons on the SB10k
  corpus}\label{cgsa:fig:sb10k-lexicon-effect}
\end{figure*}


\subsection{Text Normalization}

Finally, the last aspect that we are going to analyze in this
evaluation is the effect of the text normalization, which we applied
to the input messages before passing them to the classifiers.  To
verify the utility of this step, we rerun all experiments from the
initial sections, using the original Twitter messages instead of their
preprocessed forms, and recalculated the results of the tested
systems.

As we can see from the figures in
Table~\ref{snt-cgsa:tbl:res-no-normalization}, switching off the
normalization has a strong negative effect on the scores of almost all
approaches except for the methods of~\citet{Yessenalina:10} and
\citet{Socher:12,Socher:13}, which notoriously keep predicting the
majority class in most of the cases in the same way as they did
before.  Apart from this, we can notice that the lexicon-based systems
(HL, JRK, KLCH, MST, and TBD) suffer the greatest loss in terms of
both macro- and micro-averaged \F{}-scores on the PotTS corpus (up to
$-0.25$ macro- and and $-0.22$ micro-\F{}).  A closer look at their
errors revealed that this deterioration is mostly due to the increased
variety of different emoticons in the dataset (which were typically
unified during the preprocessing) and the absence of these forms in
the utilized polarity list.  The second biggest quality drop is
demonstrated by the DL-based approaches BAZ, LBA$^{(1)}$, and
LBA$^{(2)}$, which apparently also got confused by the higher lexical
variety of the input and failed to optimize all their internal
parameters to properly fit this diversity.  The remaining DL- and
ML-based classifiers (especially those of MHM, GNT, and RNTN) seem to
be more resistant to the introduced changes, but still show a decrease
by up to 0.04 macro- and 0.08 micro-\F{}.  The only exception in this
case is the MVRNN system of~\citet{Socher:12}, which slightly improves
on the negative and neutral classes, leaving the majority class
pitfall.  Unfortunately, this increase appears to be too small to
positively influence the overall statistics of this method.

Regarding the breakdown of single polarity classes, we can see that
most of the rare improvements affect the recall of positive and
neutral messages, with the biggest gains demonstrated by the RAE and
RNTN approaches ($+0.37$ and $+0.11$, respectively).  Other positive
changes are fairly sporadic and produced by only few classifiers
(first of all, MVRNN).  Nevertheless, even in these exceptional cases,
the improvements are typically so small that they hardly outweigh the
decreased scores on other aspects and have virtually no effect on the
net results for all classes.

\begin{table}[htb!]
  \begin{center}
    \bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
    \begin{tabular}{p{0.162\columnwidth} % first columm
        *{9}{>{\centering\arraybackslash}p{0.074\columnwidth}} % next nine columns
        *{2}{>{\centering\arraybackslash}p{0.068\columnwidth}}} % last two columns
      \toprule
      \multirow{2}*{\bfseries Method} & %
      \multicolumn{3}{c}{\bfseries Positive} & %
      \multicolumn{3}{c}{\bfseries Negative} & %
      \multicolumn{3}{c}{\bfseries Neutral} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Macro\newline \F{}$^{+/-}$} & %
      \multirow{2}{0.068\columnwidth}{\bfseries\centering Micro\newline \F{}}\\
      \cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}

      & Precision & Recall & \F{} & %
      Precision & Recall & \F{} & %
      Precision & Recall & \F{} & & \\\midrule

      \multicolumn{12}{c}{\cellcolor{cellcolor}PotTS}\\

      % Training hu-liu
      % Testing hu-liu
      % Evaluating hu-liu
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.63      0.30      0.40       691
      % negative       0.46      0.29      0.36       296
      % neutral       0.41      0.77      0.54       542
      % avg / total       0.52      0.46      0.44      1529
      % Macro-Averaged F1-Score (Positive and Negative Classes): 38.00%
      % Micro-Averaged F1-Score (All Classes): 46.4356%
      HL & 0.63\negdelta{0.12} & 0.3\negdelta{0.46} & 0.4\negdelta{0.36} & %
      0.46\negdelta{0.07} & 0.29\negdelta{0.14} & 0.36\negdelta{0.11} & %
      0.41\negdelta{0.26} & 0.77\posdelta{0.04} & 0.54\negdelta{0.15} & %
      0.38\negdelta{0.24} & 0.464\negdelta{0.22}\\

      % Training taboada
      % Testing taboada
      % Evaluating taboada
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.65      0.24      0.36       691
      % negative       0.46      0.27      0.34       296
      % neutral       0.41      0.83      0.55       542
      % avg / total       0.53      0.46      0.42      1529
      % Macro-Averaged F1-Score (Positive and Negative Classes): 34.81%
      % Micro-Averaged F1-Score (All Classes): 45.6508%

      TBD & 0.65\negdelta{0.12} & 0.24\negdelta{0.47} & 0.36\negdelta{0.38} & %
      0.46\negdelta{0.08} & 0.27\negdelta{0.12} & 0.34\negdelta{0.11} & %
      0.41\negdelta{0.22} & 0.83\posdelta{0.06} & 0.55\negdelta{0.14} & %
      0.348\negdelta{0.25} & 0.457\negdelta{0.22}\\

      % Training musto
      % Testing musto
      % Evaluating musto
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.63      0.29      0.40       691
      % negative       0.47      0.34      0.39       296
      % neutral       0.42      0.77      0.54       542
      % avg / total       0.52      0.47      0.45      1529
      % Macro-Averaged F1-Score (Positive and Negative Classes): 39.47%
      % Micro-Averaged F1-Score (All Classes): 46.9588%
      MST & 0.63\negdelta{0.12} & 0.29\negdelta{0.43} & 0.4\negdelta{0.34} & %
      0.47\negdelta{0.01} & 0.34\negdelta{0.13} & 0.39\negdelta{0.09} & %
      0.42\negdelta{0.26} & 0.77\posdelta{0.05} & 0.54\negdelta{0.16} & %
      0.4\negdelta{0.21} & 0.47\negdelta{0.21}\\

      % Training jurek
      % Testing jurek
      % Evaluating jurek
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.59      0.30      0.40       691
      % negative       0.42      0.18      0.25       296
      % neutral       0.41      0.80      0.54       542
      % avg / total       0.50      0.45      0.42      1529
      % Macro-Averaged F1-Score (Positive and Negative Classes): 32.69%
      % Micro-Averaged F1-Score (All Classes): 45.3891%
      JRK & 0.44\negdelta{0.16} & 0.22\negdelta{0.09} & 0.29\negdelta{0.12} & %
      0.14\negdelta{0.28} & 0.06\negdelta{0.14} & 0.08\negdelta{0.19} & %
      0.36\negdelta{0.07} & 0.7\negdelta{0.1} & 0.47\negdelta{0.09} & %
      0.19\negdelta{0.15} & 0.36\negdelta{0.11}\\

      % Training kolchyna
      % Testing kolchyna
      % Evaluating kolchyna
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.61      0.23      0.33       691
      % negative       0.33      0.21      0.26       296
      % neutral       0.41      0.82      0.55       542
      % avg / total       0.48      0.43      0.39      1529
      % Macro-Averaged F1-Score (Positive and Negative Classes): 29.52%
      % Micro-Averaged F1-Score (All Classes): 43.4925%
      KLCH & 0.61\negdelta{0.1} & 0.23\negdelta{0.49} & 0.33\negdelta{0.38} & %
      0.33\negdelta{0.01} & 0.21\posdelta{0.04} & 0.26\posdelta{0.04} & %
      0.41\negdelta{0.25} & 0.82 & 0.55\negdelta{0.18} & %
      0.3\negdelta{0.17} & 0.44\negdelta{0.21}\\

      %% Gamon
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.59      0.77      0.66       691
      %% negative       0.37      0.14      0.20       296
      %% neutral       0.57      0.55      0.56       542
      %% avg / total       0.54      0.57      0.54      1529
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 43.26%
      %% Micro-Averaged F1-Score (All Classes): 56.6383%

      GMN & 0.59\negdelta{0.08} & 0.77\posdelta{0.04} & 0.66\negdelta{0.04} & %
      0.37\negdelta{0.02} & 0.14\negdelta{0.01} & 0.2\negdelta{0.01} & %
      0.57\negdelta{0.03} & 0.55\negdelta{0.17} & 0.56\negdelta{0.1} & %
      0.43\negdelta{0.02} & 0.57\negdelta{0.05}\\

      %% Mohammad
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.78      0.76      0.77       691
      %% negative       0.59      0.54      0.56       296
      %% neutral       0.70      0.74      0.72       542
      %% avg / total       0.71      0.71      0.71      1529
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 66.80%
      %% Micro-Averaged F1-Score (All Classes): 71.3538%
      MHM & 0.78\negdelta{0.01} & 0.76\negdelta{0.01} & 0.77\negdelta{0.01} & %
      0.59\posdelta{0.01} & 0.54\negdelta{0.02} & 0.56\negdelta{0.01} & %
      0.7\negdelta{0.03} & 0.74\negdelta{0.02} & 0.72\negdelta{0.02} & %
      0.67\negdelta{0.006} & 0.71\negdelta{0.007}\\

      %% Guenther (2014)
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.68      0.80      0.73       691
      %% negative       0.55      0.43      0.48       296
      %% neutral       0.67      0.59      0.62       542
      %% avg / total       0.65      0.65      0.65      1529
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 60.72%
      %% Micro-Averaged F1-Score (All Classes): 65.3368%
      GNT & 0.68\negdelta{0.03} & 0.8 & 0.73\negdelta{0.02} & %
      0.55 & 0.43\negdelta{0.02} & 0.48\negdelta{0.02} & %
      0.67\negdelta{0.01} & 0.59\negdelta{0.04} & 0.62\negdelta{0.03} & %
      0.61\negdelta{0.017} & 0.65\negdelta{0.02}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.45      1.00      0.62       691
      %% negative       0.00      0.00      0.00       296
      %% neutral       0.00      0.00      0.00       542
      %% avg / total       0.20      0.45      0.28      1529
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 31.13%
      %% Micro-Averaged F1-Score (All Classes): 45.1929%
      Y\&C & 0.45 & 1.0 & 0.62 & %
      0.0 & 0.0 & 0.0 & %
      0.0 & 0.0 & 0.0 & %
      0.31 & 0.45\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.46      0.98      0.62       691
      %% negative       0.00      0.00      0.00       296
      %% neutral       0.63      0.05      0.09       542
      %% avg / total       0.43      0.46      0.31      1529
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 31.18%
      %% Micro-Averaged F1-Score (All Classes): 46.1086%
      RAE & 0.46\negdelta{0.15} & 0.98\posdelta{0.37} & 0.62\posdelta{0.01} & %
      0.0\negdelta{0.22} & 0.0\negdelta{0.01} & 0.0\negdelta{0.03} & %
      0.63\posdelta{0.15} & 0.05\negdelta{0.67} & 0.09\negdelta{0.48} & %
      0.31\negdelta{0.01} & 0.46\negdelta{0.08}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.45      0.92      0.60       691
      %% negative       0.08      0.01      0.01       296
      %% neutral       0.26      0.03      0.06       542
      %% avg / total       0.31      0.43      0.29      1529
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 30.66%
      %% Micro-Averaged F1-Score (All Classes): 43.0347%
      MVRNN & 0.45 & 0.92\negdelta{0.08} & 0.6\negdelta{0.02} & %
      0.08\posdelta{0.08} & 0.01\posdelta{0.01} & 0.01\posdelta{0.01} & %
      0.26\posdelta{0.26} & 0.03\posdelta{0.03} & 0.06\posdelta{0.06} & %
      0.31 & 0.43\negdelta{0.02}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.45      0.93      0.61       691
      %% negative       0.29      0.01      0.01       296
      %% neutral       0.40      0.07      0.12       542
      %% avg / total       0.40      0.45      0.32      1529
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 31.17%
      %% Micro-Averaged F1-Score (All Classes): 44.9313%
      RNTN & 0.45 & 0.93\posdelta{0.11} & 0.61\posdelta{0.02} & %
      0.29\posdelta{0.05} & 0.01\negdelta{0.05} & 0.01\negdelta{0.09} & %
      0.4\negdelta{0.03} & 0.07\negdelta{0.1} & 0.12\negdelta{0.12} & %
      0.31\negdelta{0.03} & 0.45\negdelta{0.01}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.56      0.79      0.66       691
      %% negative       0.00      0.00      0.00       296
      %% neutral       0.57      0.57      0.57       542
      %% avg / total       0.45      0.56      0.50      1529
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 32.76%
      %% Micro-Averaged F1-Score (All Classes): 56.1805%
      SEV & 0.56\negdelta{0.17} & 0.79\posdelta{0.05} & 0.66\negdelta{0.08} & %
      0.0 & 0.0 & 0.0 & %
      0.57\posdelta{0.01} & 0.57\negdelta{0.27} & 0.57\negdelta{0.11} & %
      0.33\negdelta{0.04} & 0.56\negdelta{0.08}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.65      0.59      0.62       691
      %% negative       0.62      0.22      0.32       296
      %% neutral       0.50      0.74      0.60       542
      %% avg / total       0.59      0.57      0.55      1529
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 46.83%
      %% Micro-Averaged F1-Score (All Classes): 56.8345%
      BAZ & 0.65\negdelta{0.17} & 0.59\negdelta{0.13} & 0.62\negdelta{0.15} & %
      0.62 & 0.22\negdelta{0.27} & 0.32\negdelta{0.23} & %
      0.5\negdelta{0.18} & 0.74\negdelta{0.11} & 0.6\negdelta{0.16} & %
      0.47\negdelta{0.19} & 0.57\negdelta{0.16}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.58      0.77      0.66       691
      %% negative       0.54      0.53      0.54       296
      %% neutral       0.63      0.37      0.46       542
      %% avg / total       0.59      0.58      0.56      1529
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 59.72%
      %% Micro-Averaged F1-Score (All Classes): 57.9464%
      LBA$^{(1)}$ & 0.58\negdelta{0.18} & 0.77\negdelta{0.07} & 0.66\negdelta{0.13} & %
      0.54\negdelta{0.06} & 0.53\negdelta{0.03} & 0.54\negdelta{0.04} & %
      0.63\negdelta{0.12} & 0.37\negdelta{0.31} & 0.46\negdelta{0.26} & %
      0.6\negdelta{0.09} & 0.58\negdelta{0.15}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.67      0.52      0.59       691
      %% negative       0.51      0.44      0.47       296
      %% neutral       0.52      0.70      0.60       542
      %% avg / total       0.58      0.57      0.57      1529
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 53.01%
      %% Micro-Averaged F1-Score (All Classes): 57.0307%
      LBA$^{(2)}$ & 0.67\negdelta{0.17} & 0.52\negdelta{0.21} & 0.59\negdelta{0.19} & %
      0.51\negdelta{0.06} & 0.44\negdelta{0.04} & 0.47\negdelta{0.06} & %
      0.52\negdelta{0.14} & 0.7\negdelta{0.12} & 0.6\negdelta{0.13} & %
      0.53\negdelta{0.12} & 0.57\negdelta{0.15}\\

      \multicolumn{12}{c}{\cellcolor{cellcolor}SB10k}\\

      % Training hu-liu
      % Testing hu-liu
      % Evaluating hu-liu
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.41      0.42      0.42       354
      % negative       0.24      0.28      0.26       212
      % neutral       0.66      0.63      0.65       930
      % avg / total       0.54      0.53      0.54      1496
      % Macro-Averaged F1-Score (Positive and Negative Classes): 33.75%
      % Micro-Averaged F1-Score (All Classes): 53.2086%
      HL & 0.41\negdelta{0.08} & 0.42\negdelta{0.2} & 0.42\negdelta{0.13} & %
      0.24\negdelta{0.03} & 0.28\negdelta{0.06} & 0.26\negdelta{0.04} & %
      0.66\negdelta{0.07} & 0.63\negdelta{0.01} & 0.65\negdelta{0.02} & %
      0.34\negdelta{0.08} & 0.53\negdelta{0.05}\\

      % Training taboada
      % Testing taboada
      % Evaluating taboada
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.41      0.37      0.39       354
      % negative       0.21      0.24      0.22       212
      % neutral       0.65      0.66      0.66       930
      % avg / total       0.53      0.53      0.53      1496
      % Macro-Averaged F1-Score (Positive and Negative Classes): 30.77%
      % Micro-Averaged F1-Score (All Classes): 53.3422%
      TBD & 0.41\negdelta{0.07} & 0.37\negdelta{0.23} & 0.39\negdelta{0.14} & %
      0.21\negdelta{0.03} & 0.24\negdelta{0.03} & 0.22\negdelta{0.03} & %
      0.65\negdelta{0.07} & 0.66\posdelta{0.03} & 0.66\negdelta{0.01} & %
      0.31\negdelta{0.08} & 0.53\negdelta{0.04}\\

      % Training musto
      % Testing musto
      % Evaluating musto
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.40      0.32      0.35       354
      % negative       0.26      0.30      0.28       212
      % neutral       0.65      0.68      0.67       930
      % avg / total       0.54      0.54      0.54      1496
      % Macro-Averaged F1-Score (Positive and Negative Classes): 31.55%
      % Micro-Averaged F1-Score (All Classes): 54.0775%
      MST & 0.4\negdelta{0.05} & 0.32\negdelta{0.17} & 0.35\negdelta{0.12} & %
      0.26\negdelta{0.03} & 0.3\negdelta{0.05} & 0.28\negdelta{0.04} & %
      0.65\negdelta{0.05} & 0.68\negdelta{0.04} & 0.67 & %
      0.32\negdelta{0.08} & 0.54\negdelta{0.03}\\

      % Training jurek
      % Testing jurek
      % Evaluating jurek
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.40      0.42      0.41       354
      % negative       0.36      0.26      0.30       212
      % neutral       0.69      0.72      0.71       930
      % avg / total       0.58      0.59      0.58      1496
      % Macro-Averaged F1-Score (Positive and Negative Classes): 35.67%
      % Micro-Averaged F1-Score (All Classes): 58.5561%
      JRK & 0.4\negdelta{0.01} & 0.42\negdelta{0.03} & 0.41\negdelta{0.01} & %
      0.36 & 0.26 & 0.3 & %
      0.69 & 0.72\negdelta{0.03} & 0.71\negdelta{0.01} & %
      0.36\posdelta{0.01} & 0.59\negdelta{0.006}\\

      % Training kolchyna
      % Testing kolchyna
      % Evaluating kolchyna
      % General Statistics:
      % precision    recall  f1-score   support
      % positive       0.42      0.21      0.28       354
      % negative       0.25      0.13      0.17       212
      % neutral       0.66      0.86      0.75       930
      % avg / total       0.55      0.60      0.56      1496
      % Macro-Averaged F1-Score (Positive and Negative Classes): 22.51%
      % Micro-Averaged F1-Score (All Classes): 60.3610%
      KLCH & 0.42\posdelta{0.03} & 0.21\negdelta{0.01} & 0.28 & %
      0.25\negdelta{0.09} & 0.13 & 0.17\negdelta{0.02} & %
      0.66 & 0.86 & 0.75 & %
      0.23\negdelta{0.005} & 0.6\negdelta{0.002}\\

      %% Gamon
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.48      0.31      0.37       354
      %% negative       0.27      0.07      0.11       212
      %% neutral       0.69      0.90      0.78       930
      %% avg / total       0.58      0.64      0.59      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 24.28%
      %% Micro-Averaged F1-Score (All Classes): 64.1043

      GMN & 0.48\negdelta{0.17} & 0.31\negdelta{0.14} & 0.37\negdelta{0.16} & %
      0.27\negdelta{0.11} & 0.07\negdelta{0.01} & 0.11\negdelta{0.02} & %
      0.69\negdelta{0.03} & 0.9\negdelta{0.03} & 0.78\negdelta{0.03} & %
      0.24\negdelta{0.09} & 0.64\negdelta{0.06}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.67      0.62      0.65       354
      %% negative       0.59      0.42      0.49       212
      %% neutral       0.80      0.88      0.84       930
      %% avg / total       0.74      0.75      0.74      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 56.61%
      %% Micro-Averaged F1-Score (All Classes): 75.1337%
      MHM & 0.67\negdelta{0.04} & 0.62\negdelta{0.03} & 0.65\negdelta{0.03} & %
      0.59\negdelta{0.08} & 0.42\negdelta{0.02} & 0.49\negdelta{0.04} & %
      0.8 & 0.88\negdelta{0.01} & 0.84 & %
      0.56\negdelta{0.002} & 0.75\negdelta{0.001}\\

      %% General Statistics:
      %%              precision    recall  f1-score   support
      %%    positive       0.70      0.58      0.63       354
      %%    negative       0.55      0.31      0.40       212
      %%     neutral       0.77      0.90      0.83       930
      %% avg / total       0.72      0.74      0.72      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 51.50%
      %% Micro-Averaged F1-Score (All Classes): 73.7968%
      GNT & 0.42\negdelta{0.25} & 0.21\negdelta{0.41} & 0.28\negdelta{0.36} & %
      0.25\negdelta{0.19} & 0.13\negdelta{0.15} & 0.17\negdelta{0.17} & %
      0.66\negdelta{0.12} & 0.86\negdelta{0.01} & 0.75\negdelta{0.07} & %
      0.22\negdelta{0.2} & 0.604\negdelta{0.12}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.00      0.00      0.00       354
      %% negative       0.00      0.00      0.00       212
      %% neutral       0.62      1.00      0.77       930
      %% avg / total       0.39      0.62      0.48      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 0.00%
      %% Micro-Averaged F1-Score (All Classes): 62.1658%
      Y\&C & 0.0 & 0.0 & 0.0 & %
      0.0 & 0.0 & 0.0 & %
      0.62 & 1.0 & 0.77 & %
      0.0 & 0.62\\

      %% General Statistics:
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.46      0.62      0.53       354
      %% negative       0.18      0.02      0.03       212
      %% neutral       0.77      0.82      0.79       930
      %% avg / total       0.61      0.66      0.62      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 28.12%
      %% Micro-Averaged F1-Score (All Classes): 66.0428%
      RAE & 0.46\negdelta{0.04} & 0.62\negdelta{0.11} & 0.53\negdelta{0.06} & %
      0.18\negdelta{0.17} & 0.02\negdelta{0.04} & 0.03\negdelta{0.07} & %
      0.77\negdelta{0.03} & 0.82\posdelta{0.02} & 0.79\negdelta{0.01} & %
      0.28\negdelta{0.07} & 0.66\negdelta{0.02}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.19      0.01      0.03       354
      %% negative       0.00      0.00      0.00       212
      %% neutral       0.62      0.97      0.76       930
      %% avg / total       0.43      0.61      0.48      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 1.31%
      %% Micro-Averaged F1-Score (All Classes): 60.8957%
      MVRNN & 0.19 & 0.01 & 0.03 & %
      0.0 & 0.0 & 0.0 & %
      0.62 & 0.97 & 0.76 & %
      0.01 & 0.61\\

      %% General Statistics:
      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.00      0.00      0.00       354
      %% negative       0.00      0.00      0.00       212
      %% neutral       0.62      1.00      0.77       930
      %% avg / total       0.39      0.62      0.48      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 0.00%
      %% Micro-Averaged F1-Score (All Classes): 62.0321%
      RNTN & 0.0 & 0.0 & 0.0 & %
      0.0 & 0.0 & 0.0 & %
      0.62 & 1.0 & 0.77 & %
      0.0 & 0.62\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.58      0.39      0.47       354
      %% negative       0.23      0.05      0.08       212
      %% neutral       0.70      0.92      0.80       930
      %% avg / total       0.61      0.67      0.62      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 27.31%
      %% Micro-Averaged F1-Score (All Classes): 66.9786%
      SEV & 0.58\negdelta{0.06} & 0.39\negdelta{0.19} & 0.47\negdelta{0.14} & %
      0.23\negdelta{0.28} & 0.05\negdelta{0.16} & 0.08\negdelta{0.22} & %
      0.7\negdelta{0.06} & 0.92\posdelta{0.03} & 0.8\negdelta{0.02} & %
      0.27\negdelta{0.18} & 0.67\negdelta{0.05}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.69      0.54      0.60       354
      %% negative       0.36      0.49      0.41       212
      %% neutral       0.79      0.79      0.79       930
      %% avg / total       0.71      0.69      0.69      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 50.68%
      %% Micro-Averaged F1-Score (All Classes): 68.9840%
      BAZ & 0.69\negdelta{0.03} & 0.54\negdelta{0.16} & 0.6\negdelta{0.05} & %
      0.36\negdelta{0.17} & 0.49\posdelta{0.16} & 0.41 & %
      0.79 & 0.79\negdelta{0.12} & 0.79\negdelta{0.05} & %
      0.51\negdelta{0.02} & 0.69\negdelta{0.06}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.24      0.86      0.38       354
      %% negative       0.45      0.45      0.45       212
      %% neutral       0.69      0.01      0.02       930
      %% avg / total       0.55      0.27      0.16      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 41.39%
      %% Micro-Averaged F1-Score (All Classes): 27.4733%
      LBA$^{(1)}$ & 0.24\negdelta{0.36} & 0.86\posdelta{0.14} & 0.38\negdelta{0.28} & %
      0.45\negdelta{0.02} & 0.45\posdelta{0.03} & 0.45\posdelta{0.01} & %
      0.69\negdelta{0.15} & 0.01\negdelta{0.79} & 0.02\negdelta{0.8} & %
      0.41\negdelta{0.14} & 0.27\negdelta{0.46}\\

      %% General Statistics:
      %% precision    recall  f1-score   support
      %% positive       0.74      0.42      0.54       354
      %% negative       0.62      0.25      0.35       212
      %% neutral       0.73      0.95      0.82       930
      %% avg / total       0.72      0.72      0.69      1496
      %% Macro-Averaged F1-Score (Positive and Negative Classes): 44.55%
      %% Micro-Averaged F1-Score (All Classes): 72.4599%
      LBA$^{(2)}$ & 0.74\negdelta{0.02} & 0.42\negdelta{0.15} & 0.54\negdelta{0.1} & %
      0.62\posdelta{0.07} & 0.25\negdelta{0.14} & 0.35\negdelta{0.11} & %
      0.73\negdelta{0.06} & 0.95\posdelta{0.05} & 0.82\negdelta{0.02} & %
      0.45\negdelta{0.1} & 0.72\negdelta{0.03}\\\bottomrule
    \end{tabular}
    \egroup{}
    \caption[Results of MLSA methods without text normalization]{
      Results of MLSA methods without text normalization}\label{snt-cgsa:tbl:res-no-normalization}
  \end{center}
\end{table}

A similar situation also happens on the SB10k corpus, where we can see
even fewer improvements (in 10 out of 176 cases).  The biggest
increase this time ($+0.16$ recall) is demonstrated by the approach
of~\citet{Baziotis:17} on the negative class.  The remaining growths,
however, are much smaller and typically range between one and seven
percent.  On the other hand, three of the tested methods (Y\&C, MVRNN,
and RNTN) have exactly the same results as they did previously with
normalized messages, although, most of the time, these classifiers
only predict the majority label anyway.  As to the rest of the
systems, we can see that their scores are notably lower than in our
initial experiments, but the decrease is much smaller in comparison
with the PotTS corpus.  A sad exception in this case is a major drop
of the recall of neutral messages ($-0.79$) demonstrated by our
LBA$^{(1)}$ system, which, in turn, results in a significant decrease
of its macro- and micro-averaged \F{}-scores ($-0.14$ and $-0.46$,
respectively).  Other approaches (including the sibling method
LBA$^{(2)}$) behave much more stable in this regard and their average
decrease amounts to $-0.06$ macro- and $-0.03$ micro-\F{}.

Similar to the results on the PotTS data, most of the gains are
concentrated at the recall of the neutral class (four out of ten
improvements), with the other positive changes being rather sporadic
and affecting only a few classifiers.  Nevertheless, unlike in the
previous case, this time, we can even observe a slight improvement of
the macro-averaged \F-measure for one of the systems (the
lexicon-based approach of~\citeauthor{Jurek:15}), but its
micro-averaged metric remains mainly unaffected by this increase.  In
general, however, the vast majority of macro- and micro-\F-scores show
an obvious decline on both datasets, which once again proves the
advantage of preprocessing.

\section{Summary and Conclusions}\label{slsa:subsec:conclusions}

Now the we have reached the end of the chapter, we would like to
remind the reader that in this part of the thesis we have made the
following findings and contributions:
\begin{itemize}
  \item we have compared three major families of message-level
    sentiment analysis methods: lexicon-, machine-learning-- and
    deep-learning--based ones, finding that the last two groups
    significantly outperform lexicon-driven systems;
  \item surprisingly, among all compared lexicon methods, the most
    simple one (the classifier of~\citeauthor{Hu:04}
    [\citeyear{Hu:04}]) produced the best macro- and micro-averaged
    \F{}-results on the PotTS corpus (0.615 and 0.685, respectively)
    and also yielded the highest macro \F{}-measure on the SB10k
    dataset (0.421).  Other systems, however, could have improved
    their scores if they better handled the negation of polar terms
    (after switching off the negation component in the method
    of~\citeauthor{Musto:14}, its macro-\F{} on the PotTS corpus
    increased to 0.641, surpassing the benchmark
    of~\citeauthor{Hu:04});
  \item as expected, the ML-based system of~\citet{Mohammad:13}---the
    winner of the inaugural run of SemEval task in sentiment analysis
    of Twitter~\cite{Nakov:13}---also surpassed other ML competitors,
    achieving highly competitive results: 0.674 macro- and 0.727
    micro-\F{} on the PotTS data, and 0.564 macro- and 0.752
    micro-averaged \F{}-measure on the SB10k test set;
  \item as in the previous case, however, these results could have
    been improved if the classifier dispensed with character-level and
    part-of-speech features and used logistic regression instead of
    SVM;
  \item a much more varied situation was observed with
    deep-learning--based systems, which frequently simply fell into
    always predicting the majority class for all tweets, but sometimes
    yielded extraordinarily good results as it was the case with our
    proposed lexicon-based attention system, which attained 0.69
    macro-\F{} on the PotTS corpus and 0.55 macro \F{}-score on the
    SB10k dataset (0.73 and 0.75 micro-\F{}, respectively), setting a
    new state of the art for the former data;
  \item speaking of word embeddings, we should note that almost all
    DL-based approaches showed fairly low scores when they used
    randomly initialized task-specific embeddings, but notably
    improved their results after switching to pre-trained word2vec
    vectors, and benefited even more from the least-squares fallback;
  \item against our expectations, we could not overcome the majority
    class pitfall of DL-based systems after adding more weakly
    supervised training data, which, in general, only lowered the
    scores of both ML- and DL-based methods.  Since this result
    contradicts the findings of other authors, we hypothesize that
    this degradation is primarily due to the differences in the class
    distributions between automatically and manually labeled tweets;
  \item on the other hand, we could see that using more qualitative
    sentiment lexicons (especially manually curated and
    dictionary-based ones) resulted in further improvements for the
    systems that relied on this lexical resource;
  \item last but not least, we proved the utility of the text
    normalization step, which brought about significant improvements
    for all tested methods, as confirmed by our last ablation test.
\end{itemize}