-
Notifications
You must be signed in to change notification settings - Fork 0
/
sentiment_corpus.tex
1403 lines (1260 loc) · 68.5 KB
/
sentiment_corpus.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% FILE: sentiment.tex Version 0.01
% AUTHOR: Uladzimir Sidarenka
% This is a modified version of the file main.tex developed by the
% University Duisburg-Essen, Duisburg, AG Prof. Dr. Günter Törner
% Verena Gondek, Andy Braune, Henning Kerstan Fachbereich Mathematik
% Lotharstr. 65., 47057 Duisburg entstanden im Rahmen des
% DFG-Projektes DissOnlineTutor in Zusammenarbeit mit der
% Humboldt-Universitaet zu Berlin AG Elektronisches Publizieren Joanna
% Rycko und der DNB - Deutsche Nationalbibliothek
\chapter{Sentiment Corpus}\label{chap:corpus}
A crucial prerequisite for proving any hypotheses in computational
linguistics is the existence of sufficiently big manually annotated
datasets, on which these conjectures could be tested. Since there
were no human-labeled sentiment data for German Twitter that we were
aware of at the time of writing this chapter, we decided to create our
own corpus, which we will introduce in this part of the thesis.
We begin our introduction by describing the selection criteria and
tracking procedure that we used to collect the initial corpus data.
After presenting the annotation scheme, we perform an extensive
analysis of the inter-annotator agreement. For this purpose, we
introduce two new versions of the popular $\kappa$
metric~\cite{Cohen:60}---binary and proportional kappa---which have
been specifically adjusted to the peculiarities of our annotation
task. Using these measures, we check the inter-coder reliability of
annotated sentiments, their targets and sources, polar terms and their
modifying elements (intensifiers, diminishers, and negations). In the
final step, we estimate the correlation between the initial selection
rules and the number of labeled elements as well as the difficulty of
their annotation.
\section{Data Collection}
A common question that typically arises first when one starts creating
a new dataset is which selection criteria should be used in order to
collect the initial data. Whereas for low-level NLP applications,
such as part-of-speech tagging or syntactic parsing, it typically
suffices to define the language domain to sample from (since the
phenomena of interest are usually frequent and uniformly spread), for
semantically demanding tasks with many diverse ways of expression one
also needs to consider various in-domain factors, which might
significantly affect the final distribution, making the resulting
corpus either utterly sparse or excessively biased.
In order to minimize both of these risks (sparseness and bias), we
decided to use a compromise approach by gathering one part of the new
dataset from microblogs that were a priori more likely to have
sentiments (thus increasing the recall) and sampling the rest of the
corpus uniformly at random (thus reducing the bias).
As criteria that could help us get more opinions, we considered topic
and form of the tweets, assuming that some subjects, especially social
or political issues, would be more amenable to subjective statements.
Because we started creating the corpus in spring 2013, obvious choices
of opinion-rich topics to us were \emph{the papal conclave}, which
took place in March of that year, and \emph{the German federal
elections}, which were held in autumn. Since both of these events
implied some form of voting, we decided to counterbalance the election
specifics by including \emph{general political discussions} as the
third subject in our dataset. Finally, to obey the second principle,
\ie{} to keep the corpus bias low, we sampled the rest of the data
from \emph{casual everyday conversations} without any prefiltering.
We collected messages for the first and third groups by tracking
German microblogs between March and September 2013 via the public
Twitter API\footnote{\url{https://pypi.python.org/pypi/tweetstream}}
with the help of extensive keyword lists describing these
topics.\footnote{A full list of tracking keywords is available at
\url{https://github.com/WladimirSidorenko/PotTS/blob/master/docs/tracking_keywords.pdf}.}
Tweets for the second topic (German federal elections) were provided
to us by a research group of communication scientists from the
University of M\"unster, who were our cooperation partner in a joint
BMBF project ``Discourse Analysis in Social Media.'' Finally, for the
fourth category (casual everyday conversations), we used the complete
German Twitter snapshot of~\citet{Scheffler:14}, which includes
$\approx97\%$ of all German microblogs posted in April 2013. This
way, we obtained a total of 27.4~M messages, with the snapshot corpus
being by far the most prolific source of the data.
%% For our work, the in-domain factors to consider were the topics and
%% the form of tweets. Since we wanted our corpus to be as
%% representative as possible, we had to make sure that the topics we
%% choose for sampling lend themselves as fruitful opinion sources. At
%% the same time, we did not want automatically generated ad and news
%% tweets to spoil our data and also introduced additional formal
%% criteria (described below) that the tweets had to satisfy in order to
%% be chosen. But then again, applying these restriction might make the
%% dataset excessively biased, so we did allow for a certain proportion
%% of tweets being selected even if they did not conform to our
%% constraints.
%% Politics: 59,531 messages, keywords: Altmaier, Wowereit, Minister,
%% Bundeskanzleramt, Schwarz-Gelb
%% Papst: 51,579 messages, keywords: papst, pabst, konklave, Vatikan
%% General Tweets: 51,579 messages, keywords: papst, pabst, konklave, Vatikan
%% Federal Elections: 3,131,315 messages, keywords:
%% General: 24,179,871 messages, keywords:
In the next step, we divided all tweets of the same topic into three
groups based on the following formal criteria:
\begin{itemize}
\item We put all messages that contained at least one polar term from
the sentiment lexicon of \citet{Remus:10} into the first group;
\item Microblogs that did not satisfy the first condition, but had at
least one exclamation mark or emoticon were allocated to the second
group;
\item All remaining microblogs were assigned to the third category.
\end{itemize}
A detailed breakdown of the resulting distribution across topics and
formal groups is given in Table~\ref{snt:tbl:corp:topic-bins}.
\begin{table}[hbt!]\small
\begin{center}
\bgroup\setlength\tabcolsep{0.13\tabcolsep}\scriptsize
\begin{tabular}{l*{5}{>{\centering\arraybackslash}p{0.155\textwidth}}}
\toprule
& \multicolumn{4}{c}{\bfseries Formal Criterion} & \\\cmidrule{2-5}
\multirow{-2}{0.2\columnwidth}{\centering\bfseries
Topic} & Polar Terms & Emoticons & Remaining Tweets & Total &\multirow{-2}{0.12\textwidth}{\centering\bfseries Sample\\ Keywords}\\\midrule
Federal Elections & 537,083 (22.38\%) & 50,567 (2.1\%) & 1,811,742 (75.5\%) & 2,399,392 & \tiny\emph{Abgeordnete} (\emph{representative}), \emph{Regierung} (\emph{government})\\
Papal Conclave & 7,859 (15.11\%) & 1,260 (2.42\%) & 42,879 (82.46\%) & 51,998 & \tiny\emph{Papst} (\emph{pope}), \emph{Pabst} (\emph{pobe})\\
Political Discussions & 10,552 (25.8\%) & 777 (1.9\%) & 29,555 (72.29\%) & 40,884 & \tiny\emph{Politik} (\emph{politics}), \emph{Minister} (\emph{minister})\\
General Conversations & 3,201,847 (18.7\%) & 813,478 (4.7\%) & 13,088,008 (76.5\%) & 17,103,333 & \tiny\emph{den} (\emph{the}), \emph{sie} (\emph{she})\\
\bottomrule
\end{tabular}
\egroup{}
\caption[Distribution of downloaded messages across topics and
formal groups]{Distribution of downloaded messages across topics
and formal groups\newline (percentages are given with respect to
the total number of tweets pertaining to the given
topic)\label{snt:tbl:corp:topic-bins}}
\end{center}
\end{table}
%% Furthermore, as one also can observe, the relative proportions of the
%% formal tweets for the federal elections and political discussions are
%% approximately the same, while general conversations are apparently
%% more biased towards containing polar terms, whereas tweets about the
%% papal conclave, on the contrary, are rather supposed to be objectve.
%% We will check later in Subsection \ref{sec:snt:iaa} whether the
%% distribution of the annotated sentiments correlates with this
%% proportional split of groups and topics.
To create the final corpus, we randomly sampled 666 tweets from each
of the three formal classes for each of the four topics, getting a
total of 7,992 messages ($666\text{ microblogs} \times 3\text{ formal
criteria} \times 4\text{ topics}$).
\section{Annotation Scheme}\label{subsec:snt:ascheme}
In the next step, we devised an annotation scheme for our data. To
maximally cover all relevant sentiment aspects, we came up with an
extensive list of elements that had to be annotated by our experts.
This list included:
\begin{itemize}
\item
\textbf{\markable{sentiment}}s, which were defined as \emph{polar
subjective evaluative opinions about people, entities, or events}.
According to our definition, a \markable{sentiment} always had to
evaluate an entity that was explicitly mentioned in text---the
target; and the annotators had to label both the target and its
respective evaluative expression with the \markable{sentiment}
tag. Apart from tagging the text span, they also had to specify the
following attributes of opinions:
\begin{itemize}
\item\attribute{polarity}, which reflected the attitude of opinion's
holder to the evaluated entity. Following
\citet{Jindal:06a,Jindal:06b}, we distinguished between
\emph{positive}, \emph{negative}, and \emph{comparative}
sentiments;
\item\attribute{intensity}, which showed the emotional strength of
an opinion. Possible values for this attribute were: \emph{weak},
\emph{medium}, and \emph{strong};
\item finally, drawing on the works of~\citet{Bosco:13} and
\citet{Rosenthal:14}, we introduced a special boolean attribute
\attribute{sarcasm} in order to distinguish sarcastically meant
statements;
\end{itemize}
%% According to this definition, three important constraints that a
%% potential subjective statement had to satisfy in order to be
%% labeled as a sentiment in our dataset
%% were: \begin{inparaenum}[(i)] \item \textit{polarity}, \ie{} the
%% statement in question had to reflect an either positive or
%% negative attitude; \item \textit{subjectivity}, \ie{} the
%% expressed opinion had to be a personal belief not verifiable by
%% any objective means; and, finally, \item an \textit{evaluative}
%% nature, \ie{} the opinionated proposition had to refer to some
%% clearly discernable target from its surrounding
%% context. \end{inparaenum}
\item
we specified \textbf{\markable{target}}s as \emph{(real,
hypothetical, or collective) entities, properties, or propositions
(states or events) evaluated by opinions}. For this item, we
introduced the following three attributes:
\begin{itemize}
\item
a boolean property \attribute{preferred}, which distinguished
entities that were favored in comparisons;
\item
a link attribute \attribute{anaphref}, which had to point to the
antecedent of a pronominal target;
\item and, finally, another edge feature,
\attribute{sentiment-ref}, which had to link \markable{target}s
to their respective \markable{sentiment}s in the cases when the
\markable{target} span was located at intersection of two
opinions;
\end{itemize}
\item
another important component of \markable{sentiment}s were
\textbf{\markable{source}}s, which denoted \emph{the immediate
author(s) or holder(s) of opinions}. The only property associated
with this element was \attribute{sentiment-ref}, which was defined
the same way as for \markable{target}s.
\end{itemize}
To help our annotators identify exact boundaries of these elements, we
explicitly asked them to annotate \emph{smallest complete syntactic or
discourse-level units}, \ie{} noun phrases or sentences with all
their grammatical dependents.
A sample tweet analyzed according to this rule is shown in
Example~\ref{snt:exmp:sent-anno1}.
\begin{example}\label{snt:exmp:sent-anno1}
\upshape\sentiment{\target{Diese Milliardeneinnahmen} sind selbst
\source{Sch\"auble} peinlich}\\[0.8em]
\noindent\sentiment{\target{\itshape{}These billions of
revenue\upshape{}}\itshape{} are embarrassing even for\\
\upshape{}\source{\itshape{}Sch\"auble\upshape{}}}
\end{example}
In this message, we assigned the \markable{sentiment} tag to the
complete sentence because this grammatical unit is the smallest
syntactic constituent that simultaneously includes both the target of
the opinion (``Milliardeneinnahmen'' [\emph{billions of revenue}]) and
its evaluation (``peinlich'' [\emph{embarrassing}]). Furthermore, we
also labeled the whole noun phrase ``diese Milliardeneinnahmen''
(\emph{these billions of revenue}), including the demonstrative
pronoun ``diese'' (\emph{these}), as \markable{target}, since this
pronoun syntactically depends on the main target word
``Milliardeneinnahmen'' (\emph{billions of revenue}).
Apart from \markable{sentiment}s, \markable{target}s,
\markable{source}s, we also asked the annotators to label elements
that could significantly affect the intensity and polarity of an
opinion. These elements were:
\begin{itemize}
\item
\textbf{\markable{polar term}}s, which we defined as \emph{words or
idioms that had a distinguishable evaluative lexical meaning}.
Typical examples of such terms were lexemes or set phrases such as
``ekelhaft'' (\emph{disgusting}), ``lieben'' (\emph{to love}),
``Held'' (\emph{hero}), ``wie die Pest meiden'' (\emph{to avoid like
the pest}). In contrast to \markable{target}s and
\markable{source}s, which only could occur in the presence of a
\markable{sentiment}, \markable{polar term}s were independent of
other tags and always had to be labeled in the corpus.
The main attributes of this element (\attribute{polarity},
\attribute{intensity}, and \attribute{sarcasm}) largely coincided
with the corresponding properties of \markable{sentiment}s, with the
only difference that, in the case of \markable{polar term}s, these
features had to reflect the lexical meaning of a word without taking
into account its context (\ie{} \emph{prior} polarity and
intensity), whereas for \markable{sentiment}s, they had to show the
compositional meaning of the whole opinion (\ie{} its
\emph{contextual} polarity and intensity).
Besides these common properties, \markable{polar term}s also had
their specific attributes: two boolean features
(\attribute{subjective-fact} and \attribute{uncertain}) and a link
attribute (\attribute{sen\-ti\-ment\--ref}). The first feature
showed whether a polar term denoted a factual entity with a clear
emotional connotation, \eg{} ``Atombombe'' (\emph{A-bomb}) or
``Naturschutz'' (\emph{nature protection}); the second property
signified cases in which the annotators were unsure about their
decisions; finally, the last attribute was defined in the same way
as it was previously specified for \markable{target}s and
\markable{source}s;
\item
\emph{elements that increased the expressivity and subjective sense
of polar terms} had to be labeled as
\textbf{\markable{intensifier}}s. Typical examples of such
expressions were adverbial modifiers such as ``sehr'' (\emph{very}),
``super'' (\emph{super}), ``stark'' (\emph{strongly});
\item
\textbf{\markable{diminisher}}s, on the contrary, were \emph{words
or phrases that reduced the strength of a polar term}. Like
\markable{intensifier}s, these elements were usually expressed by
adverbs, \eg{} ``weniger'' (\emph{less}), ``kaum'' (\emph{hardly}),
``fast'' (\emph{almost}).
Both of these tags (\markable{intensifier}s and
\markable{diminisher}s) only had two attributes: a binary feature
\attribute{degree} with two possible values: \emph{medium} and
\emph{strong}; and a link attribute \attribute{polar-term-ref},
which connected the modifier to its \markable{polar-term};
\item the final element, \textbf{\markable{negation}}s, was defined as
\emph{grammatical or lexical means that reversed the semantic
orientation of a polar term}. These were typically represented by
the negative particle ``nicht'' (\emph{not}) or indefinite pronoun
``keine'' (\emph{no}). The only attribute associated with this tag
was a mandatory link \attribute{polar-term-ref}.
\end{itemize}
In contrast to sentiment-level tags, which had to be assigned to
syntactic or discourse-level units, \markable{polar term}s and their
modifiers were defined as lexemes and, correspondingly, had to mark
only single words or set phrases without their grammatical dependents.
A complete tweet annotated with sentiment- and term-level elements is
shown in Example~\ref{snt:exmp:sent-anno2}. In this case, we again
labeled the whole sentence as \markable{sentiment} because only the
main verb-phrase simultaneously covers both the evaluated target
(``Die Nazi-Vergangenheit'' [\emph{The Nazi history}]) and its
respective polar expression (``nicht sehr r\"uhmlich'' [\emph{not very
laudable}]). The boundaries of \markable{sentiment} and
\markable{target} are determined on the syntactic level, spanning the
whole clause in the former case and including the complete noun phrase
in the latter. The polarity of the opinion is set to \emph{negative}.
The polar term ``r\"uhmlich'' (\emph{laudable}), its intensifier
``sehr'' (\emph{very}), and negation ``nicht'' (\emph{not}), on the
other hand, only mark single words. The polarity of the term, \ie{}
its primary semantic orientation without the context, is
\emph{positive}.
\begin{example}\label{snt:exmp:sent-anno2}
%% \small
\tikzstyle{every picture}+=[remember picture]
\tikzstyle{na} = [shape=rectangle,inner sep=0pt]
\upshape\sentiment{\target{Die Nazi-Vergangenheit} ist
\negation{\tikz\node[na](word0){nicht};}
\intensifier{\tikz\node[na](word1){sehr};}
\emoexpression{\tikz\node[na](word2){r\"uhmlich};}}\\[2.2em]
\noindent\sentiment{\target{\itshape{}The Nazi
history\upshape{}}\itshape{} is
\negation{\tikz\node[na](word3){not};}
\upshape{}\intensifier{\tikz\node[na](word4){very};}
\upshape{}
\emoexpression{\itshape{}\tikz\node[na](word5){laudable};\upshape{}}}
\begin{tikzpicture}[overlay]
\path[->,deeppink4,thick](word0) edge [in=145, out=35] node
[above] {\tiny polar-term-ref} (word2);
\path[->,cyan,thick](word1) edge [in=145, out=30] node
[above] {\tiny polar-term-ref} (word2);
\path[->,deeppink4,thick](word3) edge [in=145, out=35] node
[above] {\tiny polar-term-ref} (word5);
\path[->,cyan,thick](word4) edge [in=145, out=30] node
[above] {\tiny polar-term-ref} (word5);
\end{tikzpicture}
\end{example}
A more detailed description of all annotation elements and their
possible attributes is given in the original annotation guidelines in
Appendix~\ref{chap:apdx:corp-guidelines} of this thesis.
\section{Annotation Tool and Format}\label{subsec:snt:tformat}
For annotating the collected data, we used \texttt{MMAX2}, a freely
available text-markup
tool.\footnote{\url{http://mmax2.sourceforge.net/}} Because this
program uses a token-oriented stand-off format, where all annotated
spans are stored in a separate file and only refer to the ids of words
in the original text, we first had to split all corpus messages into
tokens. To this end, we applied a minimally modified version of
Christopher Potts' social media
tokenizer,\footnote{\url{http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py}}
which had been slightly adjusted to the peculiarities of the German
spelling (we allowed for the capitalized form of common nouns, \eg{}
``Freude'' [\emph{joy}], and the period at the end of ordinal numbers,
\eg{} ``7.'' [\emph{7th}]).
%% \begin{wrapfigure}{R}{0.5\textwidth}
%% \begin{minipage}[t][21.5em]{0.5\textwidth}%
%% \hspace{3em}
%% \scalebox{0.65}{%
%% \begin{minipage}[t][18em]{0.3\textwidth}%
%% \small\vspace{-3em}%
%% \dirtree{%
%% .0 .
%% .1 corpus/.
%% .2 annotator-1/.
%% .2 annotator-2/.
%% .4 1.general.mmax.
%% .4 ....
%% .4 markables/.
%% .5 1.general\_diminisher\_level.xml.
%% .5 1.general\_polar-term\_level.xml.
%% .5 1.general\_intensifier\_level.xml.
%% .5 1.general\_negation\_level.xml.
%% .5 1.general\_sentiment\_level.xml.
%% .5 1.general\_source\_level.xml.
%% .5 1.general\_target\_level.xml.
%% .5 ....
%% .2 basedata/.
%% .3 1.general.words.xml.
%% .3 ....
%% .2 custom/.
%% .3 polar-term\_customization.xml.
%% .3 ....
%% .2 scheme/.
%% .3 polar-term\_scheme.xml.
%% .3 ....
%% .2 source/.
%% .3 1.general.xml.
%% .3 ....
%% .2 style/.
%% .1 docs/.
%% .2 annotation\_guidelines.pdf.
%% .2 ....
%% .1 scripts/.
%% .2 ....
%% }%
%% \end{minipage}
%% }%
%% \vspace{13em}
%% \caption{Directory structure of the sentiment
%% corpus\label{fig:snt:corpus}}%
%% \end{minipage}
%% \end{wrapfigure}
To ease the annotation process and minimize possible data loss, we
split the corpus into 80 smaller project files with \numrange{99}{109}
tweets each. In each such file, we put microblogs pertaining to the
same topic, ensuring an equal proportion of formal groups.
%% In the last preparation step, we finally created the corresponding
%% scheme and customization settings for the project, which specified
%% what kinds of elements with which attributes were to be annotated by
%% the human coders, and how these elements had to look like.
%% The resulting folder hierarchy of our dataset is shown in Figure
%% \ref{fig:snt:corpus}.
%% As can be seen from this listing, the top-level level structure of our
%% project consists of three main directories:
%% \begin{itemize}
%% \item\texttt{corpus/}, which includes the actual annotation data;
%% \item\texttt{docs/}, in which we placed the annotation guidelines and
%% various supplementary documents, such as annotation tests for new
%% coders;
%% \item and \texttt{scripts/}, which comprises auxiliary scripts for
%% estimating the inter-annotator agreement and aligning corpus
%% annotations with automatically parsed sentences.
%% \end{itemize}
%% The \texttt{corpus/} folder is further subdivided into the
%% subdirectories:
%% \begin{itemize}
%% \item\texttt{annotator-X/}, where X stands for the annotator's id.
%% This directory includes the main project files, which specify the
%% paths to the annotation directory, tokenization data, appearance
%% settings etc.; and the subfolder \texttt{markables/}, which
%% comprises the actual annotations;
%% \item\texttt{basedata/}, which contains files with tokenized messages;
%% \item\texttt{custom/}, which provides customization settings for the
%% annotation elements (\eg{} their back- and foreground colors, font
%% types and size etc.);
%% \item\texttt{scheme/}, which includes the definitions of the
%% annotation markables,\footnote{In the \texttt{MMAX} terminology, an
%% annotation markable is a synonym for an annotation element.}
%% their attributes, and possible attributes' values;
%% \item\texttt{source/}, where we put the original untokenized
%% microblogs;
%% \item and, finally, \texttt{style/}, which is the standard
%% \texttt{MMAX} directory for storing default settings.
%% \end{itemize}
%% Examples of an actual annotation file and the underlying tokenization
%% data are given in Figures \ref{fig:snt:annofile} and
%% \ref{fig:snt:basefile}.
%% \begin{minipage}[t]{\textwidth}
%% \begin{minipage}[t]{0.45\textwidth}
%% \lstset{language=XML}
%% \begin{lstlisting}
%% <?xml version="1.0" encoding="UTF-8"?>
%% <!DOCTYPE markables SYSTEM "markables.dtd">
%% <markables xmlns="www.eml.org/NameSpaces/sentiment">
%% <markable id="markable_70"
%% span="word_592..word_596" sarcasm="false"
%% mmax_level="sentiment" polarity="positive"
%% intensity="strong" />
%% <markable id="markable_132"
%% span="word_1126..word_1139" sarcasm="false"
%% mmax_level="sentiment" polarity="positive"
%% intensity="medium" />
%% <markable id="markable_256"
%% span="word_1056..word_1071" sarcasm="false"
%% mmax_level="sentiment" polarity="negative"
%% intensity="medium" />
%% <markable id="markable_259"
%% span="word_1074..word_1087" sarcasm="false"
%% mmax_level="sentiment" polarity="positive"
%% intensity="medium" />
%% ...
%% </markables>
%% \end{lstlisting}%
%% \captionof{figure}{Example of an annotation file\label{fig:snt:annofile}}%
%% \end{minipage}\hfill%
%% %
%% \begin{minipage}[t]{0.45\textwidth}%
%% \lstset{language=XML}
%% \begin{lstlisting}[basicstyle=\tiny]
%% <?xml version="1.0" encoding="US-ASCII"?>
%% <!DOCTYPE words SYSTEM "words.dtd">
%% <words>
%% <word id="word_1">Gleich</word>
%% <word id="word_2">in</word>
%% <word id="word_3">Braunschweig</word>
%% <word id="word_4">mit</word>
%% <word id="word_5">Kamaraden</word>
%% <word id="word_6">Treffen</word>
%% <word id="word_7">:)</word>
%% <word id="word_8">EOL</word>
%% <word id="word_9">@graulich12</word>
%% <word id="word_10">Das</word>
%% <word id="word_11">geht</word>
%% <word id="word_12">ja</word>
%% <word id="word_13">gar</word>
%% <word id="word_14">nicht</word>
%% <word id="word_15">!</word>
%% ...
%% </words>
%% \end{lstlisting}%
%% \captionof{figure}{Example of tokenized data\label{fig:snt:basefile}}%
%% \end{minipage}
%% \end{minipage}
\section{Inter-Annotator Agreement Metrics}\label{sec:snt:iaa}
For estimating the inter-annotator agreement (IAA), we adopted the
popular $\kappa$ metric \cite{Cohen:60}. Following the standard
practice, we computed this term as:
\begin{equation*}
\kappa = \frac{p_o - p_c}{1 - p_c},
\end{equation*}
where $p_o$ denotes the observed agreement, and $p_c$ stands for the
agreement by chance. We estimated the observed reliability in the
normal way as the ratio of tokens with matching annotations to the
total number of tokens:
\begin{equation*}
p_o = \frac{T - A_1 + M_1 - A_2 + M_2}{T},
\end{equation*}
where $T$ represents the total token count, $A_1$ and $A_2$ are the
numbers of tokens annotated with the given class by the first and
second annotators respectively, and the $M$ terms mean the numbers of
tokens with matching annotations. As usual, we computed the chance
agreement $p_c$ as:
\begin{equation*}\textstyle
p_c = c_1 \times c_2 + (1.0 - c_1) \times (1.0 - c_2).
\end{equation*}
where $c_1$ and $c_2$ are the proportions of tokens annotated with the
given class in the first and second annotations, respectively, \ie{}
$c_1 = \frac{A_1}{T}$ and $c_2 = \frac{A_2}{T}$.
Two questions that arose during this computation though were
\begin{inparaenum}[(i)]
\item whether tokens belonging to multiple overlapping annotation
spans of the same class had to be counted several times in one
annotation when computing the $A$ scores (for instance, whether we
had to count the words ``dieses'' [\textit{this}], ``sch\"one''
[\textit{nice}], and ``Buch'' [\textit{book}] in Example
\ref{example:snt:iaa} twice as sentiments when computing $A_1$ and
$A_2$), and
\item whether we had to assume that two annotated spans from
different experts agreed on all of their tokens if these spans had
at least one word in common (\eg{} whether we had to consider the
annotation of the token ``Mein'' [\textit{My}] in the example as
matching, regarding that the rest of the corresponding
\markable{sentiment}s agreed).
\end{inparaenum}
\begin{example}\label{example:snt:iaa}
\textcolor{red3}{\textbf{Annotation 1:}}\\
\upshape\sentiment{Mein Vater hasst \sentiment{dieses sch\"one Buch}.}\\
\sentiment{\itshape My father hates \upshape\sentiment{\itshape this
nice book\upshape}.}
\noindent\textcolor{darkslateblue}{\textbf{\itshape Annotation 2:}}\\
Mein \sentiment{Vater hasst \sentiment{dieses sch\"one Buch}.}\\
\itshape My \upshape\sentiment{\itshape{}father hates \upshape\sentiment{\itshape this
nice book\upshape}.}
\end{example}
To address these issues, we introduced two different agreement
metrics---\emph{binary} and \emph{proportional} kappa. With the
former variant, we counted tokens belonging to overlapping annotation
spans of the same class multiple times (\ie{} $A_1$ and $A_2$ would
amount to $10$ and $9$, respectively, in the above tweet) and
considered all tokens belonging to the given annotated element as
matching if this span agreed with the annotation from the other expert
on at least one token (\ie{} $M_1$ and $M_2$ would have the same
values as $A_1$ and $A_2$ in this case). With the latter metric,
every labeled token was counted only once (\ie{} the numbers of
labeled words in the first and second annotations would be $7$ and
$6$, respectively), and we only calculated the actual number of tokens
with matching labels when computing the $M$ scores (\ie{} both $M_1$
and $M_2$ would be equal to $6$). The final value of the binary kappa
in Example~\ref{example:snt:iaa} would consequently run up to~1.0
because this metric would consider both annotations as perfectly
matching, since every labeled \markable{sentiment} agreed with the
other annotation on at least one token. The proportional kappa,
however, would be equal to~0.0, since this metric would emphasize the
fact that the observed reliability $p_o$ is the same as the agreement
by chance $p_c$, and would therefore deem both labelings as
fortuitous.
\section{Annotation Procedure}\label{sec:astages}
After defining the agreement metrics, we finally let our experts
annotate the data. The annotation procedure was performed in three
steps:
\begin{itemize}
\item At the beginning, both annotators labeled one half of the
corpus after only minimal training. Unfortunately, their mutual
agreement at this stage was relatively low, reaching only 31.21\%
proportional-$\kappa$ for \markable{sentiment}s;
\item In the second step, in order to improve the inter-rater
reliability, we automatically determined all differences between
the two annotations and highlighted non-matching tokens with a
separate class of tags. Then, we let the experts resolve these
discrepancies by either correcting their own decisions or
rejecting the variants of the other coder. As in the previous
stage, we allowed the annotators to consult their supervisor (the
author of this thesis), also updating the FAQ section of the
guidelines based on their questions, but did not let them
communicate with each other directly. This adjudication step
significantly improved all annotations: The agreement on
\markable{sentiment}s increased by 30.73\%, reaching 61.94\%.
Similar effects were observed for \markable{target}s,
\markable{source}s, \markable{polar term}s, and their modifiers;
\item After resolving all differences, our assistants proceeded with
the annotation of remaining files. Working completely
independently, one of the experts has annotated 78.8\% of the
corpus, whereas the second annotator has labeled the complete
dataset.
\end{itemize}
\section{Evaluation}\label{sec:eval}
\subsection{Initial Annotation Stage}\label{subsec:eval-initial-stage}
The agreement results of the initial annotation stage are shown in
Table~\ref{tbl:snt:agrmnt-init}.
\begin{table*}[thb!]
\begin{center}
\bgroup \setlength\tabcolsep{0.7\tabcolsep} \scriptsize
\begin{tabular}{p{0.154\textwidth} % first columm
*{10}{>{\centering\arraybackslash}p{0.065\textwidth}}} % next ten columns
\toprule
\multirow{2}{0.2\textwidth}{\bfseries Element} &
\multicolumn{5}{c}{\bfseries Binary $\kappa$} & %
\multicolumn{5}{c}{\bfseries Proportional $\kappa$}\\
\cmidrule(r){2-6}\cmidrule(l){7-11}
& $M_1$ & $A_1$ & $M_2$ & $A_2$ & $\mathbf{\kappa}$ %
& $M_1$ & $A_1$ & $M_2$ & $A_2$ & $\mathbf{\kappa}$\\\midrule
Sentiment & 4,215 & 7,070 & 3,484 & 9,827 & \textbf{38.05} &
3,269 & 6,812 & 3,269 & 9,796 & \textbf{31.21}\\
Target & 1,103 & 1,943 & 1,217 & 4,162 & \textbf{35.48} &
898 & 1,905 & 898 & 4,148 & \textbf{26.85}\\
Source & 159 & 445 & 156 & 456 & \textbf{34.53} &
153 & 439 & 153 & 456 & \textbf{33.75}\\
Polar Term & 1,951 & 2,854 & 2,029 & 3,188 & \textbf{64.29} &
1,902 & 2,851 & 1,902 & 3,180 & \textbf{61.36}\\
Intensifier & 57 & 101 & 59 & 123 & \textbf{51.71} &
57 & 101 & 57 & 123 & \textbf{50.81}\\
Diminisher & 3 & 10 & 3 & 8 & \textbf{33.32} &
3 & 10 & 3 & 8 & \textbf{33.32}\\
Negation & 21 & 63 & 21 & 83 & \textbf{28.69} &
21 & 63 & 21 & 83 & \textbf{28.69}\\\bottomrule
\end{tabular}
\egroup
\end{center}
\captionof{table}[Inter-annotator agreement after the initial
annotation stage]{Inter-annotator agreement after the initial
annotation stage\\ {\small ($M1$ -- number of tokens with matching
labels in the first annotation, $A1$ -- total number of tokens
labeled with that class in the first annotation, $M2$ -- number
of tokens with matching labels in the second annotation, $A2$ --
total number of tokens labeled with that class in the second
annotation)}}
\label{tbl:snt:agrmnt-init}
\end{table*}
As we can see from the table, the inter-rater reliability of
\markable{sentiment}s strongly correlates with the inter-annotator
agreement on \markable{target}s and \markable{source}s, setting an
upper bound for these elements in the binary-$\kappa$ case. With the
proportional metric, however, both \markable{sentiment}s and
\markable{target}s show worse results than \markable{source}s:
$31.21\%$ and $26.85\%$ versus $33.75\%$. We explain this difference
by the fact, that \markable{sentiment}s and \markable{target}s are
typically represented by syntactic or discourse-level constituents
(noun phrases or clauses) and, even though the experts agreed on the
presence of these elements more often (as suggested by the
binary-$\kappa$ metric), reaching a consensus about the exact
boundaries of these elements was still a challenging task for them
despite an explicit clarification of this problem in the annotation
guidelines; \markable{source}s, on the other hand, are usually
expressed by pronouns, which rarely accept syntactic attributes, so
that their boundaries were easier to determine. Nevertheless, even
with the binary metric, the agreement of all sentiment-level elements
is significantly below the $40\%$ threshold, which means only a slight
reliability according to the \citeauthor{Landis:77} scale
\cite{Landis:77}.
A different situation is observed for \markable{polar terms} and
\markable{intensifiers}. The inter-annotator agreement of these
elements is above 50\%, for both $\kappa$-measures. Obviously,
defining these entities as lexical units has significantly eased the
detection of their boundaries. This effect becomes even more evident
if we look at \markable{diminisher}s and \markable{negation}s, where
the $A$ and $M$ scores are absolutely identical for both metrics. It
means that both annotators always agreed on the boundaries of these
elements when they agreed on their presence. Unfortunately, due to a
rather small number of these tags in the corpus (with only 3 cases of
\markable{diminisher}s and 21 cases of \markable{negation}s), the
overall agreement on these labels is relatively small too, amounting
to $33.32\%$ and $28.69\%$, respectively.
\subsection{Adjudication Step}\label{subsec:eval-adjudication-step}
Since these scores were unacceptable for running further experiments,
we decided to revise diverging annotations by letting our experts
recheck each other's decisions.
%% To this end, we automatically determined conflicting labelings and
%% highlighted them in the annotated \texttt{MMAX2} files.
%% Afterwards, the coders had to decide whether to ignore the
%% highlighted discrepancies or to change ther own decisions.
\begin{table*}[htb!]
\begin{center}
\bgroup \setlength\tabcolsep{0.7\tabcolsep} \scriptsize
\begin{tabular}{p{0.155\textwidth} % first columm
*{10}{>{\centering\arraybackslash}p{0.065\textwidth}}} % next ten columns
\toprule
\multirow{2}{0.2\textwidth}{\bfseries Element} &
\multicolumn{5}{c}{\bfseries Binary $\kappa$} & %
\multicolumn{5}{c}{\bfseries Proportional $\kappa$}\\
\cmidrule(r){2-6}\cmidrule(l){7-11}
& $M_1$ & $A_1$ & $M_2$ & $A_2$ & $\mathbf{\kappa}$ %
& $M_1$ & $A_1$ & $M_2$ & $A_2$ & $\mathbf{\kappa}$\\
\midrule
Sentiment & 8,198 & 8,530 & 8,260 & 14,034 & \textbf{67.92} &
7,435 & 8,243 & 7,435 & 13,714 & \textbf{61.94}\\
Target & 3,088 & 3,407 & 2,814 & 5,303 & \textbf{65.66} &
2,554 & 3,326 & 2,554 & 5,212 & \textbf{57.27}\\
Source & 573 & 690 & 545 & 837 & \textbf{72.91} &
539 & 676 & 539 & 833 & \textbf{71.12}\\
Polar Term & 3,164 & 3,298 & 3,261 & 4,134 & \textbf{85.68} &
3,097 & 3,290 & 3,097 & 4,121 & \textbf{82.64}\\
Intensifier & 111 & 219 & 113 & 180 & \textbf{56.01} &
111 & 219 & 111 & 180 & \textbf{55.51}\\
Diminisher & 9 & 16 & 10 & 16 & \textbf{59.37} &
9 & 16 & 9 & 15 & \textbf{58.05}\\
Negation & 68 & 84 & 67 & 140 & \textbf{60.21} &
67 & 83 & 67 & 140 & \textbf{60.03}\\\bottomrule
\end{tabular}
\egroup
\end{center}
\captionof{table}[Inter-annotator agreement after the adjudication
step]{Inter-annotator agreement after the adjudication step\\
{\small ($M1$ -- number of tokens with matching labels in the
first annotation, $A1$ -- total number of labeled tokens in the
first annotation, $M2$ -- number of tokens with matching labels
in the second annotation, $A2$ -- total number of labeled tokens
in the second annotation)}}
\label{tbl:snt:agrmnt-adjud}
\end{table*}
As we can see from the results in Table~\ref{tbl:snt:agrmnt-adjud},
this procedure has significantly improved the inter-rater reliability
of all annotated elements: the binary scores of \markable{sentiment}s
and \markable{target}s increased by $29.87\%$ and $30.18\%$,
respectively. An even greater improvement is observed for
\markable{source}s, whose binary kappa improved by remarkable
$38.38\%$. A similar tendency applies to the proportional metric,
where the agreement of \markable{sentiment}s gained $30.73\%$,
reaching $61.94\%$. Likewise, the reliability of opinion targets and
holders improved by $30.42\%$ and $37.37\%$, running up to $57.27\%$
and $71.12\%$.
%% In general, however, we can see that the second annotator labeled
%% almost twice as many \markable{sentiment}s and \markable{target}s as
%% the first expert: the proportional $A_1$ scores of these two entity
%% types amount to 8,243 and 3,326 tokens, whereas the corresponding
%% $A_2$ counts run up to 13,714 and 5,212 words.
%% A better consistency in this regard is achieved by sources, where
%% the number of the labeled items in both annotations differs only by
%% a factor of 1.2.
As in the previous step, the highest agreement scores are attained by
\markable{polar term}s, whose reliability notably surpasses the 80\%
benchmark, which means an almost perfect agreement. Interestingly
enough, only 193 out of 3,290 terms annotated by the first expert did
not match the labelings of the second annotator. Another interesting
observation is that the difference between the binary and proportional
scores of \markable{polar terms} only amounts to 3.04\%, which implies
that the assistants could unproblematically determine the boundaries
of these elements in most of the cases.
Somewhat surprisingly, the agreement of \markable{intensifier}s
improved notably less. A closer look at the annotated cases revealed
that the majority of their disagreements stemmed from different takes
of exclamation marks: the first expert ignored these punctuation
marks, whereas the second annotator considered them as valid
intensifying elements. Nevertheless, even despite these diverging
interpretations, the reliability of \markable{intensifier}s is above
$55\%$, which means a moderate level.
\subsection{Final Annotation Stage}\label{subsec:eval-final-annotation}
After ensuring that our annotators could reach an acceptable quality
of annotation, we eventually let them label the remaining part of the
data. The agreement results of the final stage computed on the files
annotated by both experts are given in
Table~\ref{tbl:snt:agrmnt-final}.
\begin{table*}[thb!]
\begin{center}
\bgroup \setlength\tabcolsep{0.7\tabcolsep} \scriptsize
\begin{tabular}{p{0.155\textwidth} % first columm
*{10}{>{\centering\arraybackslash}p{0.065\textwidth}}} % next ten columns
\toprule
\multirow{2}{0.2\textwidth}{\bfseries Element} &
\multicolumn{5}{c}{\bfseries Binary $\kappa$} & %
\multicolumn{5}{c}{\bfseries Proportional $\kappa$}\\
\cmidrule(r){2-6}\cmidrule(l){7-11}
& $M_1$ & $A_1$ & $M_2$ & $A_2$ & $\mathbf{\kappa}$ %
& $M_1$ & $A_1$ & $M_2$ & $A_2$ & $\mathbf{\kappa}$\\
\midrule
Sentiment & 14,748 & 15,929 & 14,969 & 26,047 & \textbf{65.03} &
13,316 & 15,375 & 13,316 & 25,352 & \textbf{58.82}\\
Target & 5,765 & 6,629 & 5,292 & 9,852 & \textbf{64.76} &
4,789 & 6,462 & 4,789 & 9,659 & \textbf{56.61}\\
Source & 966 & 1,207 & 910 & 1,619 & \textbf{65.99} &
898 & 1,180 & 898 & 1,604 & \textbf{64.1}\\
Polar Term & 5,574 & 5,989 & 5,659 & 7,419 & \textbf{82.83} &
5,441 & 5,977 & 5,441 & 7,395 & \textbf{80.29}\\
Intensifier & 192 & 432 & 194 & 338 & \textbf{49.97} & 192 &
432 & 192 & 338 & \textbf{49.71}\\
Diminisher & 16 & 30 & 17 & 34 & \textbf{51.55} & 16 & 30 &
16 & 33 & \textbf{50.78}\\
Negation & 111 & 132 & 110 & 243 & \textbf{58.87} & 110 &
131 & 110 & 242 & \textbf{58.92}\\\bottomrule
\end{tabular}
\egroup
\end{center}
\captionof{table}[Inter-annotator agreement of the final
corpus]{Inter-annotator agreement of the final corpus\\ {\small
($M1$ -- number of tokens with matching labels in the first
annotation, $A1$ -- total number of labeled tokens in the first
annotation, $M2$ -- number of tokens with matching labels in the
second annotation, $A2$ -- total number of labeled tokens in the
second annotation)}}
\label{tbl:snt:agrmnt-final}
\end{table*}
This time, we can observe a slight decrease of the results: the
proportional score for \markable{sentiment}s dropped by $3.12\%$,
whereas the agreement on \markable{target}s was more persistent and
lost only $0.66\%$, going down to $56.61\%$. The most dramatic
changes occurred for \markable{source}s, whose proportional value
deteriorated by notable $7.02\%$, sinking to $64.1\%$. Nonetheless,
the average proportional agreement of all these elements is around
$60.5\%$, which is almost twice as high as the mean reliability
achieved in the first stage.
As before, the scores of \markable{polar term}s are in the ballpark of
almost perfect results. Their modifying elements, however, show a
decrease: the agreement of \markable{intensifier}s deteriorated by
5.8\%, sinking to 49.71\% proportional kappa. A similar situation is
observed for \markable{diminisher}s, whose kappa worsened from
$58.05\%$ to $50.78\%$. The best persistence in this regard is shown
by \markable{negation}s, where the quality dropped by only $1.11\%$,
which can be considered as a very good result, regarding the small
number of these elements in the corpus.
In general, we can see that the reliability of all elements in the
final dataset is at least moderate, with \markable{polar term}s being
the most reliably annotated elements ($\kappa_{\textrm{p}}=80.29\%$),
and \markable{intensifier}s setting a lower bound on the agreement
($\kappa_{\textrm{p}}=49.71\%$).
\subsection{Qualitative Analysis}\label{subsec:eval-qualitative-analysis}
In order to understand the reasons for remaining conflicts, we decided
to have a closer look at the diverging cases. A sample sentence with
different analyses of \markable{sentiment}s is shown in
Example~\ref{snt:exmp:sent-disagr}:
\begin{example}\label{snt:exmp:sent-disagr}
\textcolor{red3}{\textbf{Annotation 1:}}\\ \upshape{}@TinaPannes
immerhin ist die \#afd nicht dabei \smiley{}\\[0.8em]\itshape
\noindent\textcolor{darkslateblue}{\textbf{\itshape Annotation
2:}}\\ \upshape{}@TinaPannes
\sentiment{\textcolor{red}{\target{immerhin ist die \#afd nicht
dabei} \smiley{}}}\\[0.8em]
\noindent\itshape{}@TinaPannes
\upshape\sentiment{\textcolor{red}{\itshape{}\target{anyway the
\#afd is not there} \smiley{}}\upshape{}}
\end{example}
In this tweet, the first annotator obviously overlooked the emoticon
\smiley{} at the end of the message, whereas the second expert
correctly recognized it as an evaluation of the previous sentence.
Because the first assistant did not label any \markable{sentiment} at
all, she also automatically disagreed on the \markable{target} of this
opinion.
%% At this point, we should note that it also was legitimate to consider
%% the noun phrase ``die \#afd'' (\emph{the \#afd}) as the object of the
%% evaluation in this message. We, however, advised the assistants to be
%% as specific as possible when determining the target
%% elements. Therefore, when a sentiment was related to a particular
%% action performed by an agent (which, in this case, was the fact of
%% afd's being somewhere) rather than the agent herself, they better had
%% to label the complete verb phrase and not only its acting subject.
%% With this rule, we hoped to distinguish targets in sentences like
%% ``die Partei hat das Gesetz verabschiedet \smiley{}'' (\emph{the party
%% has adopted this law \smiley{}}) from the objects of evaluations in
%% microblogs like ``die Partei hat das Gesetz abgelehnt \smiley{}''
%% (\emph{the party has rejected this law \smiley{}}), which were clearly
%% describing two completely different events so that labeling similar
%% targets, \eg{} ``die Partei'' (\emph{the party}), in both of these
%% messages would be unequivocally wrong in that case.
A much rarer case of diverging \markable{target} annotations was when
both experts actually marked a \markable{sentiment} span. An example
of such situation is shown in the following message:
\begin{example}\label{snt:exmp:targt-disagr}
\textcolor{red3}{\textbf{Annotation
1:}}\\
\upshape{}\sentiment{Koalition wirft der SPD
\target{\textcolor{red}{Blockadehaltung}} vor}\\[0.5em]
\noindent\itshape{}\sentiment{Coalition accuses the SPD of
\target{\textcolor{red}{blocking politics}}}\\[0.6em]\itshape
\noindent\textcolor{darkslateblue}{\textbf{\itshape Annotation
2:}}\\
\upshape{}\sentiment{Koalition wirft \target{\textcolor{red}{der SPD}}
Blockadehaltung vor}\\[0.5em]
\noindent\itshape{}\sentiment{Coalition accuses
\target{\textcolor{red}{the SPD}} of blocking politics}
\end{example}
In this sentence, the first expert considered \emph{blocking politics}
as the main object of criticism, whereas the second annotator regarded
the political party accused of such behavior as sentiment's target.
In our opinion, both of these interpretations are correct and,
ideally, two \markable{sentiment}s had to be labeled in this message:
one with the target ``Blockadehaltung'' (\emph{blocking politics}) and
another one with the target ``die SPD'' (\emph{the SPD}).
Although our annotators were much more consistent about the analysis
of \markable{polar term}s, we still decided to have a look at
disagreeing labels of these elements. A sample case of differently
annotated \markable{polar term}s is given in
Example~\ref{snt:exmp:emo-disagr}
\begin{example}\label{snt:exmp:emo-disagr}
\textcolor{red3}{\textbf{Annotation 1:}}\\ \upshape{}Syrien vor dem
Angriff---bringen diese Bomben den Frieden?\\[0.3em]\itshape
\noindent\itshape{}Syria facing an attack---will these bombs bring
peace?\\