-
Notifications
You must be signed in to change notification settings - Fork 0
/
sentiment_lexica.tex
2383 lines (2138 loc) · 121 KB
/
sentiment_lexica.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% FILE: sentiment_lexica.tex Version 0.01
% AUTHOR: Uladzimir Sidarenka
% This is a modified version of the file main.tex developed by the
% University Duisburg-Essen, Duisburg, AG Prof. Dr. Günter Törner
% Verena Gondek, Andy Braune, Henning Kerstan Fachbereich Mathematik
% Lotharstr. 65., 47057 Duisburg entstanden im Rahmen des
% DFG-Projektes DissOnlineTutor in Zusammenarbeit mit der
% Humboldt-Universitaet zu Berlin AG Elektronisches Publizieren Joanna
% Rycko und der DNB - Deutsche Nationalbibliothek
\chapter{Sentiment Lexicons}\label{chap:snt:lex}
The first avenue that we are going to explore using the obtained data
is automatic prediction of polar terms.
% To this end, we will first present an updated version of our dataset
% in Subsection~\ref{sec:snt-lex:data} in which our experts revised
% the annotations of words and idioms that were present in the existing
% German sentiment lexicons (GSL), but were not marked as emo-expressions
% in our data and, vice versa, were annotated as polar terms in the
% corpus, but absent in the analyzed polarity lists.
For this purpose, we will first evaluate existing German sentiment
lexicons on our corpus. Since most of these resources, however, were
created semi-automatically by translating English polarity lists and
then manually post-editing and expanding these translations, we will
also look whether methods that were used to produce original English
lexicons would yield comparable results when applied to German data
directly. Finally, we will analyze whether one of the most popular
areas of research in contemporary computational linguistics,
distributed vector representations of words~\cite{Mikolov:13}, can
produce better polarity lists than previous approaches. In the
concluding step, we will investigate the effect of different
hyper-parameters and seed sets on these systems, summarizing and
concluding our findings in the last part of this chapter.
\section{Data}\label{sec:snt-lex:data}
As \emph{development set} for our experiments, we will use \emph{400
randomly selected tweets} annotated by the first expert. As gold
\emph{test set} for evaluating the lexicons, we will utilize the
\emph{complete corpus labeled by the second linguist}. These test
data comprise a total of 6,040 positive and 3,055 negative terms. But
because many of these expressions represent emoticons, which, on the
one hand, are a priori absent in common lexical taxonomies such as
\textsc{WordNet}~\cite{Miller:95,Miller:07} or
\textsc{GermaNet}~\cite{Hamp:97} and therefore not amenable to methods
that rely on these resources, but on the other hand, can be easily
captured by regular expressions, we decided to exclude non-alphabetic
smileys altogether from our study. This left us with a set of 3,459
positive and 2,755 negative labeled terms (1,738 and 1,943 unique
expressions, respectively), whose $\kappa$-agreement run up to 59\%.
\section{Evaluation Metrics}\label{sec:snt-lex:eval-metrics}
An important question that needs to be addressed before we proceed
with the experiments is which evaluation metrics we should use to
measure the quality of sentiment lexicons. Usually, this quality is
estimated either \textit{intrinsically} (by taking a lexicon in
isolation and immediately assessing its accuracy) or
\textit{extrinsically} (by considering the lexicon within the scope of
a bigger application, \eg{} a supervised classifier that uses
lexicon's entries as features).
Traditionally, intrinsic evaluation of English polarity lists amounts
to comparing these resources with the General Inquirer
lexicon~\cite[GI; ][]{Stone:66}, a manually compiled list of 11,895
words annotated with their semantic categories. For this purpose,
researchers usually take the intersection of the two sets and estimate
the percentage of matches in which automatically induced polar terms
have the same polarity as corresponding GI entries. This evaluation,
however, is somewhat problematic: First of all, it is not easily
transferable to other languages because even a manual translation of
GI is not guaranteed to cover all language- and domain-specific polar
expressions. Second, since it only considers the intersection of the
two sets, it does not penalize for low recall, so that a polarity list
that consists of just two terms \textit{good}$^+$ and \textit{bad}$^-$
will always have the highest possible score, often surpassing other
lexicons with a greater number of entries. Finally, such comparison
does not account for polysemy. As a result, an ambiguous word only
one of whose (possibly rare) senses is subjective will always be
ranked the same as an obvious polar term.
Unfortunately, an extrinsic evaluation does not always provide a
remedy in this case because different extrinsic applications might
yield different results, and a polarity list that performs best with
one system can produce fairly low scores with another application.
Instead of using these methods, we decided to evaluate sentiment
lexicons directly on our corpus by comparing their entries with the
annotated polar terms, since such approach would allow us to solve at
least three of the aforementioned problems, namely,
\begin{inparaenum}[(i)]
\item it would account for recall,
\item it would distinguish between different senses of polysemous
words,\footnote{The annotators of our corpus were asked to label a
polar term iff the actual sense of this term in the given
context was polar.} and
\item it would preclude intermediate modules that could artificially
improve or worsen the results.
\end{inparaenum}
In particular, in order to evaluate a lexicon on our dataset, we
represent this polarity list as a case-insensitive
trie~\cite[pp. 492--512]{Knuth:98} and compare this trie with the
original and lemmatized\footnote{All lemmatizations in our experiments
were performed using the \textsc{TreeTagger} of \citet{Schmid:95}.}
corpus tokens, successively matching them from left to right. We
consider a match as correct if a lexicon term completely agrees with
the (original or lemmatized) tokens of an annotated \markable{polar
term} and has the same polarity as the labeled element. All corpus
tokens that are not marked as \markable{polar term}s in the corpus are
considered as gold neutral words; similarly, all terms that are absent
from the lexicon, but present in the corpus are assumed to have a
predicted neutral polarity.
\section{Semi-Automatic Lexicons}
Using this metric, we first estimated the quality of existing German
polarity lists:
\begin{itemize}
\item\textbf{German Polarity Clues}~\cite[GPC;][]{Waltinger:10}, which
contains 10,141 polar terms from the English sentiment lexicons
Subjectivity Clues~\cite{Wilson:05} and SentiSpin~\cite{Takamura:05}
that were automatically translated into German and then manually
revised by the author. Apart from that, \citeauthor{Waltinger:10}
also manually enriched these translations with their frequent
synonyms and 290 negated phrases;\footnote{In our experiments, we
excluded the auxiliary words ``aus'' (\emph{from}), ``der''
(\emph{the}), ``keine'' (\emph{no}), ``nicht'' (\emph{not}),
``sein'' (\emph{to be}), ``was'' (\emph{what}), and ``wer''
(\emph{who}) with their inflection forms from the German Polarity
Clues, because these entries significantly worsened the evaluation
results.}
\item\textbf{SentiWS}~\cite[SWS;][]{Remus:10}, which includes 1,818
positively and 1,650 negatively connoted terms along with their
part-of-speech tags and inflections, which results in a total of
32,734 word forms. As in the previous case, the authors obtained
the initial entries for their resource by translating an English
polarity list (the General Inquirer lexicon) and then manually
correcting these translations. In addition to this, they expanded
the translated set with words and phrases that frequently
co-occurred with positive and negative seed terms in a corpus of
10,200 customer reviews or in the German Collocation
Dictionary~\cite{Quasthoff:10};\footnote{Unfortunately, the authors
do not provide a breakdown of how many terms were obtained through
translation and how many of them were added during the expansion.}
\item and, finally, the only the lexicon that was not obtained through
translation---the \textbf{Zurich Polarity
List}~\cite[ZPL;][]{Clematide:10}, which features 8,000 subjective
entries extracted from \textsc{GermaNet} synsets~\cite{Hamp:97}.
These synsets had been manually annotated by human experts with
their prior polarities. Since the authors, however, found the
number of polar adjectives obtained this way to be insufficient for
their classification experiments, they automatically enriched this
lexicon with more attributive terms, using the collocation method of
\citet{Hatzivassi:97}.
\end{itemize}
% Since all of these lexicons were created semi-automatically by either
% automatically translating English polarity lists and then manually
% revising these translations (\eg{} GPC and SWS) or by manually
% labeling an existing lexical resource and then automatically expanding
% this set (\eg{} ZPL), their results should give us an upper bound on
% the fully automated approaches which we are going to test in the
% remaining parts of this section.
For our evaluation, we tested each of the three lexicons in isolation,
and also evaluated their union and intersection in order to check for
possible ``synergy'' effects. The results of this computation are
shown in Table~\ref{snt-lex:tbl:gsl-res}.
\begin{table}[h]
\begin{center}
\bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
\begin{tabular}{p{0.167\columnwidth} % first columm
*{9}{>{\centering\arraybackslash}p{0.074\columnwidth}} % next nine columns
*{2}{>{\centering\arraybackslash}p{0.068\columnwidth}}} % last two columns
\toprule
\multirow{2}*{\bfseries Lexicon} & %
\multicolumn{3}{c}{\bfseries Positive Expressions} & %
\multicolumn{3}{c}{\bfseries Negative Expressions} & %
\multicolumn{3}{c}{\bfseries Neutral Terms} & %
\multirow{2}{0.068\columnwidth}{\bfseries\centering Macro\newline \F{}} & %
\multirow{2}{0.068\columnwidth}{\bfseries\centering Micro\newline \F{}}\\
\cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}
& Precision & Recall & \F{} & %
Precision & Recall & \F{} & %
Precision & Recall & \F{} & & \\\midrule
%% \multicolumn{9}{|c|}{\cellcolor{cellcolor}Existing Lexicons}\\\hline
% Class Precision Recall F-score
% positive 0.209155 0.534630 0.300680
% negative 0.194531 0.466468 0.274561
% neutral 0.982806 0.923144 0.952041
% Macro-average 0.462164 0.641414 0.509094
% Micro-average 0.906173 0.906591 0.906382
GPC & 0.209 & 0.535 & 0.301 & %
0.195 & 0.466 & 0.275 & %
0.983 & 0.923 & 0.952 & %
0.509 & 0.906 \\
% Class Precision Recall F-score
% positive 0.335225 0.435308 0.378767
% negative 0.484006 0.343890 0.402091
% neutral 0.976617 0.975014 0.975815
% Macro-average 0.598616 0.584737 0.585557
% Micro-average 0.952082 0.952045 0.952064
SWS & 0.335 & 0.435 & 0.379 & %
0.484 & 0.344 & \textbf{0.402} & %
0.977 & 0.975 & 0.976 & %
0.586 & 0.952\\
% Class Precision Recall F-score
% positive 0.410806 0.423519 0.417066
% negative 0.380378 0.352459 0.365887
% neutral 0.976709 0.978684 0.977696
% Macro-average 0.589298 0.584887 0.586883
% Micro-average 0.954178 0.955459 0.954818
ZPL & 0.411 & 0.424 & 0.417 & %
0.38 & 0.352 & 0.366 & %
0.977 & 0.979 & 0.978 & %
0.587 & 0.955 \\
% Intersection
% Class Precision Recall F-score
% positive 0.527372 0.371942 0.436225
% negative 0.617702 0.244411 0.350240
% neutral 0.973299 0.990414 0.981782
% Macro-average 0.706124 0.535589 0.589416
% Micro-average 0.963883 0.963695 0.963789
GPC $\cap$ SWS $\cap$ ZPL & \textbf{0.527} & 0.372 & \textbf{0.436} & %
\textbf{0.618} & 0.244 & 0.35 & %
0.973 & \textbf{0.99} & \textbf{0.982} & %
\textbf{0.589} & \textbf{0.964} \\
% Union
% Class Precision Recall F-score
% positive 0.201544 0.561745 0.296654
% negative 0.195185 0.531669 0.285543
% neutral 0.984751 0.916952 0.949643
% Macro-average 0.460493 0.670122 0.510613
% Micro-average 0.899292 0.902381 0.900834
GPC $\cup$ SWS $\cup$ ZPL & 0.202 & \textbf{0.562} & 0.297 & %
0.195 & \textbf{0.532} & 0.286 & %
\textbf{0.985} & 0.917 & 0.95 & %
0.51 & 0.901 \\\bottomrule
\end{tabular}
\egroup{}
\caption[Evaluation of semi-automatic German sentiment lexicons]{
Evaluation of semi-automatic German sentiment lexicons\\ {\small
GPC --- German Polarity Clues, SWS --- SentiWS, ZPL --- Zurich
Polarity List}}\label{snt-lex:tbl:gsl-res}
\end{center}
\end{table}
As we can see from the table, the intersection of all three polarity
lists achieves the best results for the positive and neutral classes,
and also attains the highest macro- and micro-averaged \F{}-scores.
One of the main reasons for this success is a relatively high
precision of this set for all polarities except neutral, where the
intersection is outperformed by the union of three lexicons. The last
fact is also not surprising, as the union has the highest recall of
positive and negative terms. Among all compared lexicons, the results
of the Zurich Polarity List come closest to the scores of the
intersected set: its macro-\F{} is lower by 0.002, and its
micro-average is less by 0.009. The second-best lexicon is SentiWS,
which reaches the highest \F-score for the negative class, but has a
lower precision of positive entries. Finally, German Polarity Clues
is the least reliable sentiment resource, which is also mainly due to
the low precision of its polar terms.
\section{Automatic Lexicons}
A natural question that arises upon evaluation of existing
semi-automatic lexicons is how well fully automatic methods can
perform in comparison with these resources. According to
\citet[p. 79]{Liu:12}, most automatic sentiment lexicon generation
(SLG) algorithms can be grouped into two main classes: dictionary- and
corpus-based ones. The former systems induce polarity lists from
monolingual thesauri or lexical databases such as the Macquarie
Dictionary~\cite{Bernard:86} or \textsc{WordNet}~\cite{Miller:95}. A
clear advantage of these methods is their relatively high precision,
as they operate on manually annotated, carefully verified data. At
the same time, this precision might come at the price of reduced
recall, especially in domains whose language changes very rapidly and
where new terms are coined in a flash. In contrast to this,
corpus-based systems operate directly on unlabeled in-domain texts
and, consequently, have access to all neologisms; but the downside of
these approaches is that they often have to deal with extremely noisy
input and might therefore have low accuracy. Since it was unclear to
us which of these pros and cons would have a stronger influence on the
net results, we decided to reimplement the most popular algorithms
from both of these groups and evaluate them on our corpus.
\subsection{Dictionary-Based Methods}
The presumably first SLG system that inferred a sentiment lexicon from
a lexical database was proposed by \citet{Hu:04}. In their work on
automatic classification of customer reviews, the authors
automatically compiled a list of polar adjectives (which were supposed
to be the most relevant part of speech for mining people's opinions)
by taking a set of seed terms with known semantic orientations and
propagating the polarity scores of these seeds to their
\textsc{WordNet} synonyms. A similar procedure was also applied to
antonyms, but the polarity values were reversed in this case. This
expansion continued until no more adjectives could be reached via
synonymy-antonymy links.
% Unfortunately, no intrinsic evaluation of the resulting lexicon was
% performed in this work---the authors only report their results on
% recognizing subjective sentences and classifying their polarity, where
% they attain average \F-scores of 0.667 and 0.842 respectively.
\citet{Blair-Goldensohn:08} refined this approach by considering
polarity scores of all \textsc{WordNet} terms as a single vector
$\vec{v}$; the values of all negative seeds in this vector were set to
$-1$, and the scores of all positive seed terms were fixed to $+1$.
To derive their polarity list, the authors multiplied $\vec{v}$ with
an adjacency matrix $A$. Each cell $a_{ij}$ in this matrix was set to
$\lambda=0.2$, if there was a synonymy link between synsets $i$ and
$j$, and to $-\lambda$, if these synsets were antonymous to each
other. By performing this multiplication multiple times and storing
the results of the previous iterations in the $\vec{v}$ vector, the
authors ensured that all polarity scores were propagated transitively
through the network, decaying by a constant factor ($\lambda$) as the
length of the paths starting from the original seeds increased.
% This method again was evaluated only extrinsically---the authors
% tested their complete sentiment summarization system, which used the
% sentiment scores for individual words as features for a
% maximum-entropy classifier.
With various modifications, the core idea of \citet{Hu:04} was
adopted by almost all dictionary-based works: For example,
\citet{Kim:04,Kim:06} estimated the probability of word $w$ belonging
to polarity class $c \in \{\textrm{positive, negative, neutral}\}$ as:
\begin{equation*}
P(c|w) = P(c)P(w|c) = P(c)\frac{\sum\limits_{i=1}^{n}count(syn_i,
c)}{count(c)},
\end{equation*}
where $P(c)$ is the prior probability of that class (estimated as the
number of words belonging to class $c$ divided by the total number of
words); $count(syn_i, c)$ denotes the number of times a seed term with
polarity $c$ appeared in a synset of $w$; and $count(c)$ means the
total number of synsets that contain seeds with this polarity. Using
this formula, the authors successively expanded their initial set of
34 adjectives and 44 verbs to a list of 18,192 polar terms.
% and evaluated it on a manually labeled collection of 462 adjectives
% and 502 verbs taken from the TOEFL test and analyzed by two human
% experts. The reported average accuracy for this method run up to
% 68.48\% for adjectives and 74.28\% for verbs with their recall being
% equal to 93.07\% and 83.27\% respectively. It should, however, be
% noted that \citet{Kim:04} used a lenient metric for their
% computation by considering neutral and positive terms as the same
% class which could significantly boost the results.
% An alternative way of bootstrapping polarities for adjectives was
% proposed by \citet{Kamps:04}. The authors estimated the orientation
% of the given term by computing the difference between the shortest
% path lengths of this word to the prototypic positive and negative
% lexemes---``good'' and ``bad''. For example, the polarity score of
% the adjective ``honest'' was calculated as
% \begin{equation*}
% POL(honest) = \frac{d(\textrm{honest}, \textrm{bad}) - d(\textrm{honest}, \textrm{good})}%
% {d(\textrm{bad}, \textrm{good})} = \frac{6 - 2}{4} = 1,
% \end{equation*}
% where $d(w_1, w_2)$ means the geodesic (shortest-path) distance
% between the words $w_1$ and $w_2$ in the \textsc{WordNet} graph. The
% respective orientation of this term was then correspondingly set to
% \texttt{positive} according to the sign of the obtained
% $POL$-value. \citet{Kamps:04} evaluated the accuracy of their method
% on the General Inquirer lexicon \cite{Stone:66} by comparing the terms
% with non-zero scores to the entries from this resource, getting
% 68.19\% of correct predictions on a set of 349 adjectives.
Another popular dictionary-based resource, \textsc{SentiWordNet}, was
created by \citet{Esuli:06c}, who enriched a small set of positive and
negative seed adjectives with their \textsc{WordNet} synonyms and
antonyms in $k \in \{0, 2, 4, 6\}$ iterations, considering the rest of
the terms as neutral if they did not have a subjective tag in the
General Inquirer lexicon. In each of these $k$ steps, the authors
optimized two ternary classifiers (Rocchio and SVM) that used
tf-idf--vectors of synset glosses as features. Afterwards, they
predicted polarity scores for all \textsc{WordNet} synsets using an
ensemble of all trained classifiers.
% This time, the evaluation was run on both the intersection with the
% GI~lexicon~\cite{Stone:66} and a manually annotated subset of
% \textsc{WordNet} synsets, yielding 66\% accuracy for the former
% metric.\footnote{Note that different publications on
% \textsc{SentiWordNet} report different configuration settings,
% cf. \citet{Esuli:05}, \citet{Esuli:06a}, \citet{Esuli:06b}, and
% \citet{Esuli:06c}. In our experiments, we will rely on the setup
% described in last paper as the most recent description of this
% approach.}
Graph-based SLG algorithms were proposed by \citet{Rao:09}, who
experimented with three different methods:
\begin{itemize}
\item\emph{deterministic min-cut}, in which the authors propagated the
polarity values of seeds to their \textsc{WordNet} synonyms and
hypernyms and then determined a minimum cut between the polarity
clusters using the algorithm of~\citet{Blum:01};
\item since this approach, however, always partitioned the graph in
the same way even if there were multiple possible splits with the
same cost, the authors also proposed a \emph{randomized} version of
this method, in which they randomly perturbed edge weights;
\item finally, they compared both min-cut systems with the \emph{label
propagation algorithm} of~\citet{Zhu:02}, which can be considered as
a probabilistic variant of \citeauthor{Blair-Goldensohn:08}'s
approach.
\end{itemize}
Further notable contributions to dictionary-based methods were made
by~\citet{Mohammad:09}, who compiled an initial set of polar terms by
using antonymous morphological patterns (\eg{} \emph{logical} ---
\emph{illogical}, \emph{honest} --- \emph{dishonest}, \emph{happy} ---
\emph{unhappy}) and then expanded this set with the help of the
Macquarie Thesaurus~\cite{Bernard:86}; \citet{Awadallah:10}, who
adopted a random walk approach, estimating word's polarity as a
difference between the average number of steps a random walker had to
make in order to reach a seed term from the positive or negative set;
and \citet{Dragut:10}, who computed words' polarities using manually
specified inference rules.
% Since almost all of the presented approaches used \textsc{WordNet}---a
% large lexical database with more than 117,000 synsets---and evaluated
% their results in vitro (using the General Inquirer lexicon
% \cite{Stone:66}), it remains unclear how these methods would work for
% languages with smaller lexical resources and whether they would
% perform equally well in vivo (when tested on a real-life corpus).
% Moreover, because General Inquirer is a generic standard-language
% dictionary, it is also not obvious whether the systems that perform
% best on this list would be also applicable to more colloquial domains.
For our experiments, we reimplemented the approaches of~\citet{Hu:04},
\citet{Blair-Goldensohn:08}, \citet{Kim:04,Kim:06}, \citet{Esuli:06c},
\citet{Rao:09}, and \citet{Awadallah:10}, and applied these methods to
\textsc{GermaNet}\footnote{Throughout our experiments, we will use
\textsc{GermaNet} Version 9.}~\cite{Hamp:97}, the German equivalent
of the \textsc{WordNet} taxonomy.
In order to make this comparison more fair, we used the same set of
initial seeds for all tested methods. For this purpose, we translated
the list of 14 polar English adjectives proposed by \citet{Turney:03}
(\emph{good}$^+$, \emph{nice}$^+$, \emph{excellent}$^+$,
\emph{positive}$^+$, \emph{fortunate}$^+$, \emph{correct}$^+$,
\emph{superior}$^+$, \emph{bad}$^-$, \emph{nasty}$^-$,
\emph{poor}$^-$, \emph{negative}$^-$, \emph{unfortunate}$^-$,
\emph{wrong}$^-$, and \emph{inferior}$^-$) into German, getting a
total of 20 terms (10 positive and 10 negative adjectives) due to
multiple possible translations of the same words. Furthermore, to
settle the differences between binary and ternary approaches (\ie{}
methods that only distinguished between positive and negative terms
and systems that could also predict the neutral class), we extended
the translated seeds with 10 neutral adjectives (\emph{neutral}$^0$,
\emph{objective}$^0$, \emph{technical}$^0$, \emph{chemical}$^0$,
\emph{physical}$^0$, \emph{material}$^0$, \emph{bodily}$^0$,
\emph{financial}$^0$, \emph{theoretical}$^0$, and
\emph{practical}$^0$), letting all classifiers work in the ternary
mode. Finally, since several algorithms had different takes of
synonymous relations (\eg{} \citeauthor{Hu:04} only considered two
words as synonyms if they appeared in the same synset, whereas
\citeauthor{Esuli:06c}, \citeauthor{Rao:09}, and
\citeauthor{Awadallah:10} also considered hypernyms and hyponyms as
valid links for polarity propagation), we decided to unify this aspect
as well. To this end, we established an edge between any two terms
that appeared in the same synset, and also linked all words whose
synsets were connected via \texttt{has\_participle},
\texttt{has\_pertainym}, \texttt{has\_hyponym}, \texttt{entails}, or
\texttt{is\_entailed\_by} relations. We intentionally ignored
relations \texttt{has\_hypernym} and \texttt{is\_related\_to}, because
hypernyms were not guaranteed to preserve the polarity of their
children (\eg{} ``bewertungsspezifisch'' [\emph{appraisal-specific}]
is a neutral term in contrast to its immediate hyponyms ``gut''
[\emph{good}] and ``schlecht'' [\emph{bad}]), and
\texttt{is\_related\_to} could connect both synonyms and antonyms of
the same term (\eg{} this relation holds between words ``Form''
[\emph{shape}] and ``unf\"ormig'' [\emph{misshapen}], but at the same
time, it also connects noun ``Dame'' [\emph{lady}] to its derived
adjective ``damenhaft'' [\emph{ladylike}]).
We fine-tuned the hyper-parameters of all approaches by using grid
search and optimizing the macro-averaged \F{}-score on the development
set. In particular, instead of waiting for the full convergence of
the eigenvector in the approach of \citet{Blair-Goldensohn:08}, we
constrained the maximum number of multiplications to five. Our
experiments showed that this limitation had a crucial impact on the
quality of the resulting polarity list (\eg{} after five
multiplications, the average precision of its positive terms amounted
to 0.499, reaching an average \F{}-score of 0.26 for this class; after
ten more iterations though, this precision decreased dramatically to
0.043, pulling the \F{}-score down to 0.078). Furthermore, we limited
the maximum number of iterations in the label-propagation method of
\citet{Rao:09} to 300, although the effect of this setting was much
weaker than in the previous case (by comparison, the scores achieved
after 30 runs differed only by a few hundredths from the results
obtained after 300 iterations). Finally, in the method of
\citet{Awadallah:10}, we allowed for seven simultaneous walkers with a
maximum number of 17 steps each, considering a word as polar if more
than a half of these walkers agreed on the same polarity class.
\begin{table}[h]
\begin{center}
\bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
\begin{tabular}{p{0.146\columnwidth} % first columm
>{\centering\arraybackslash}p{0.06\columnwidth} % second columm
*{9}{>{\centering\arraybackslash}p{0.072\columnwidth}} % next nine columns
*{2}{>{\centering\arraybackslash}p{0.058\columnwidth}}} % last two columns
\toprule
\multirow{2}*{\bfseries Lexicon} & %
\multirow{2}{0.06\columnwidth}{\bfseries\centering \# of\newline{} Terms} & %
\multicolumn{3}{c}{\bfseries Positive Expressions} & %
\multicolumn{3}{c}{\bfseries Negative Expressions} & %
\multicolumn{3}{c}{\bfseries Neutral Terms} & %
\multirow{2}{0.068\columnwidth}{\bfseries\centering Macro\newline \F{}} & %
\multirow{2}{0.068\columnwidth}{\bfseries\centering Micro\newline \F{}}\\
\cmidrule(lr){3-5}\cmidrule(lr){6-8}\cmidrule(lr){9-11}
& & Precision & Recall & \F{} & %
Precision & Recall & \F{} & %
Precision & Recall & \F{} & & \\\midrule
%% \multicolumn{9}{|c|}{\cellcolor{cellcolor}Existing Lexicons}\\\hline
% Class Precision Recall F-score
% positive 0.770601 0.101975 0.180115
% negative 0.567901 0.017139 0.033273
% neutral 0.963176 0.999227 0.980870
% Macro-average 0.767226 0.372780 0.398086
% Micro-average 0.962404 0.962216 0.962310
\textsc{Seed Set} & 20 & \textbf{0.771} & 0.102 & 0.18 & %
0.568 & 0.017 & 0.033 & %
0.963 & \textbf{0.999} & \textbf{0.981} & %
0.398 & \textbf{0.962}\\
% Class Precision Recall F-score
% positive 0.160648 0.266136 0.200355
% negative 0.199554 0.133383 0.159893
% neutral 0.969132 0.960190 0.964640
% Macro-average 0.443111 0.453236 0.441629
% Micro-average 0.930521 0.930387 0.930454
HL & 5,745 & 0.161 & 0.266 & 0.2 & %
0.2 & 0.133 & 0.16 & %
0.969 & 0.96 & 0.965 & %
0.442 & 0.93\\
% Class Precision Recall F-score
% positive 0.502551 0.232243 0.317678
% negative 0.284571 0.092772 0.139927
% neutral 0.967533 0.991262 0.979254
% Macro-average 0.584885 0.438759 0.478953
% Micro-average 0.958888 0.958769 0.958828
BG & 1,895 & 0.503 & 0.232 & \textbf{0.318} & %
0.285 & 0.093 & 0.14 & %
0.968 & 0.991 & 0.979 & %
\textbf{0.479} & 0.959\\
% Class Precision Recall F-score
% positive 0.715608 0.159446 0.260786
% negative 0.269406 0.043964 0.075593
% neutral 0.964973 0.996744 0.980601
% Macro-average 0.649996 0.400051 0.438993
% Micro-average 0.961759 0.961571 0.961665
KH & 356 & 0.716 & 0.159 & 0.261 & %
0.269 & 0.044 & 0.076 & %
0.965 & 0.997 & \textbf{0.981} & %
0.439 & \textbf{0.962}\\
% Class Precision Recall F-score
% positive 0.041632 0.564397 0.077544
% negative 0.033042 0.255216 0.058510
% neutral 0.981022 0.689113 0.809557
% Macro-average 0.351899 0.502909 0.315204
% Micro-average 0.612283 0.678788 0.643823
ES & 39,181 & 0.042 & \textbf{0.564} & 0.078 & %
0.033 & \textbf{0.255} & 0.059 & %
\textbf{0.981} & 0.689 & 0.81 & %
0.315 & 0.644\\
% Class Precision Recall F-score
% positive 0.070618 0.422045 0.120992
% negative 0.215708 0.072653 0.108696
% neutral 0.972028 0.873448 0.920105
% Macro-average 0.419451 0.456049 0.383264
% Micro-average 0.848630 0.849470 0.849050
RR$_{\textrm{mincut}}$ & 8,060 & 0.07 & 0.422 & 0.12 & %
0.216 & 0.073 & 0.109 & %
0.972 & 0.873 & 0.92 & %
0.383 & 0.849\\
% Class Precision Recall F-score
% positive 0.566825 0.176245 0.268885
% negative 0.571429 0.046200 0.085488
% neutral 0.965423 0.996716 0.980820
% Macro-average 0.701225 0.406387 0.445064
% Micro-average 0.962125 0.961956 0.962040
RR$_{\textrm{lbl-prop}}$ & 1,105 & 0.567 & 0.176 & 0.269 & %
\textbf{0.571} & 0.046 & 0.085 & %
0.965 & 0.997 & \textbf{0.981} & %
0.445 & \textbf{0.962}\\
% Class Precision Recall F-score
% positive 0.768182 0.099617 0.176363
% negative 0.567901 0.017139 0.033273
% neutral 0.963126 0.999233 0.980847
% Macro-average 0.766403 0.371996 0.396828
% Micro-average 0.962358 0.962170 0.962264
AR & 23 & 0.768 & 0.1 & 0.176 & %
0.568 & 0.017 & 0.033 & %
0.963 & \textbf{0.999} & \textbf{0.981} & %
0.397 & \textbf{0.962}\\
% Class Precision Recall F-score
% positive 0.600858 0.165046 0.258960
% negative 0.567442 0.045455 0.084167
% neutral 0.965096 0.997212 0.980891
% Macro-average 0.711132 0.402571 0.441339
% Micro-average 0.962327 0.962170 0.962249
HL $\cap$ BG $\cap$ RR$_{\textrm{lbl}}$ & 752 & 0.601 & 0.165 & 0.259 & %
0.567 & 0.045 & 0.084 & %
0.965 & 0.997 & \textbf{0.981} & %
0.441 & \textbf{0.962}\\
% Class Precision Recall F-score
% positive 0.165676 0.287651 0.210254
% negative 0.191198 0.145678 0.165363
% neutral 0.969910 0.957599 0.963716
% Macro-average 0.442262 0.463643 0.446444
% Micro-average 0.928663 0.928590 0.928626
HL $\cup$ BG $\cup$ RR$_{\textrm{lbl}}$ & 6,258 & 0.166 & 0.288 & 0.21 & %
0.191 & 0.146 & \textbf{0.165} & %
0.97 & 0.958 & 0.964 & %
0.446 & 0.929\\\bottomrule
\end{tabular}
\egroup{}
\caption[Results of dictionary-based approaches]{Results of
dictionary-based approaches\\ {\small HL --- \citet{Hu:04}, BG
--- \citet{Blair-Goldensohn:08}, KH --- \citet{Kim:04}, ES ---
\citet{Esuli:06c}, RR --- \citet{Rao:09}, AR ---
\citet{Awadallah:10}}}\label{snt-lex:tbl:lex-res}
\end{center}
\end{table}
As we can see from the results in Table~\ref{snt-lex:tbl:lex-res}, the
scores of all automatic systems are significantly lower than the
values achieved by semi-automatic lexicons. The best macro-averaged
\F{}-result for all three classes (0.479) is attained by the method of
\citet{Blair-Goldensohn:08}, which is still 0.11 points below the
highest score obtained by the intersection of GPC, SentiWS, and the
Zurich Polarity List. Moreover, in general, the situation with
dictionary-based lexicons is more complicated than in the case of
manually curated polarity lists, as every system demonstrates a better
score on only one metric, but fails to convincingly outperform its
competitors on several (let alone all) aspects. Nevertheless, we
still can notice at least the following main trends:
\begin{itemize}
\item the method of \citet{Esuli:06c} achieves the highest recall
of positive and negative terms, but these entries have a very low
precision;
\item simultaneously five approaches attain the same best \F{}-results
for the neutral class, which, in turn, leads to the best
micro-averaged \F{}-scores for these systems;
\item and, finally, the solution of \citet{Blair-Goldensohn:08}
achieves the highest macro-averaged \F{} despite a rather low recall
of negative expressions.
\end{itemize}
% Seed Sets:
% Hu-Liu were using 30 adjectives, but they only provided some
% examples: great, fantastic, nice, cool, bad, and dull
% Blair-Goldensohn: do not report (In our experiments, the original
% seed set contained 20 negative and 47 positive words that were
% selected by hand to maximize domain coverage, as well as 293 neutral
% words that largely consist of stop words.)
% Kim-Hovy (2004): To start the seed lists we selected verbs (23
% positive and 21 negative) and adjectives (15 positive and 19
% negative), adding nouns later. But they, again, do not report
% specific examples.
% Kim-Hovy (2006): We described a word classification system to de-
% tect opinion-bearing words in Section 2.1. To ex- amine its
% effectiveness, we annotated 2011 verbs and 1860 adjectives, which
% served as a gold stan- dard 7 . These words were randomly selected
% from a collection of 8011 English verbs and 19748 English
% adjectives. We use training data as seed words for the WordNet
% expansion part of our algorithm.
% Esuli/Sebastiani: Lp and Ln are two small sets, which we have
% defined by manually selecting the intended synsets4 for 14
% "paradigmatic" Positive and Negative terms (\eg{} the Positive term
% nice, the Negative term nasty) which were used as seed terms in
% (Turney and Littman, 2003). The Lo set is treated differently from
% Lp and Ln, because of the inherently "complementary" nature of the
% Objective category (an Objective term can be defined as a term that
% does not have either Positive or Negative characteristics). We have
% heuristically defined Lo as the set of synsets that (a) do not
% belong to either T rK p or T rK n , and (b) contain terms not marked
% as either Positive or Negative in the General Inquirer lexicon
% (Stone et al., 1966); this lexicon was chosen since it is, to our
% knowledge, the largest manually annotated lexicon in which terms are
% tagged according to the Positive or Negative categories.
% Rao-Ravichandran: All experiments reported in Sections 4.1 to 4.5
% use the data described above with a 50-50 split so that the first half
% is used as seeds and the sec- ond half is used for test.
% Awdallah: After (Turney, 2002), we use our method to predict
% semantic orientation of words in the General Inquirer lexicon (Stone
% et al., 1966) using only 14 seed words.
% seed sets: (Turney and Littman, 2003); SentiWS (Remus, 2010)
\subsection{Corpus-Based Methods}\label{subsec:snt-lex:corpus-based}
An alternative way of generating polarity lists is provided by
corpus-based approaches. In contrast to dictionary-based methods,
these systems operate immediately on raw texts and are therefore
virtually independent of any manually annotated resources.
A pioneering work on these algorithms was done by
\citet{Hatzivassi:97}. Assuming that coordinately conjoined
attributes would typically have the same semantic orientation, these
authors trained a supervised logistic classifier that predicted the
degree of dissimilarity between two co-occurring adjectives.
Afterwards, they constructed a word collocation graph, drawing a link
between any two adjectives that appeared in the same coordinate pair,
and using predicted dissimilarity score between these words as the
respective edge weight. In the final stage,
\citet{Hatzivassi:97} partitioned this graph into two clusters
and assigned the positive label to the bigger part.
% This method achieved an overall accuracy of 82.05\%
% on predicting the polarity of a subset of manually annotated
% adjectives when trained on the rest of these hand-labeled data.
An attempt to unite dictionary- and corpus-based methods was made by
\citet{Takamura:05}, who adopted the Ising spin model from statistical
mechanics, considering words found in \textsc{WordNet}, the Wall
Street Journal, and the Brown corpus as electrons in a ferromagnetic
lattice. The authors established a link between any two electrons
whose terms appeared in the same \textsc{WordNet} synset or
coordinately conjoined pair in the corpora. In the final step, they
approximated the most probable orientation of all spins in this graph,
considering these orientations as polarity scores of the respective
terms.
% reaching 91.5\% accuracy at predicting polarity of the manually
% labeled subjective terms from the General Inquirer lexicon
% \cite{Stone:66}.
Another way of creating a sentiment lexicon was proposed
by~\citet{Turney:03}, who induced a list of polar terms by computing
the difference between their point-wise mutual information (PMI) with
the positive and negative seeds. In particular, the authors estimated
the polarity score of word $w$ as:
\begin{equation*}
\textrm{SO-A}(w) = \sum_{w_p\in\mathcal{P}}PMI(w, w_p) - \sum_{w_n\in\mathcal{N}}PMI(w, w_n),
\end{equation*}
where $\mathcal{P}$ represents the set of all positive seeds;
$\mathcal{N}$ denotes the collection of known negative words; and
$PMI$ is computed as a log-ratio $PMI(w, w_x) = \log_2\frac{p(w,
w_x)}{p(w)p(w_x)}$. The joint probability $p(w, w_x)$ in the last
term was calculated as the number of hits returned by the AltaVista
search engine for the query ``$w\textrm{ NEAR }w_x$'' divided by the
total number of documents in the search index.
This method was later successfully adapted to Twitter by
\citet{Kiritchenko:14}, who harnessed the corpus of \citet{Go:09} and
an additional set of 775,000 tweets to create two sentiment lexicons,
Sentiment140 and Hashtag Sentiment Base, using frequent emoticons as
seeds for the first lexicons and taking common emotional hashtags such
as ``\#joy'', ``\#excitement'', ``\#fear'' as seed terms for the
second list.
Another Twitter-specific approach, which also relied on the corpus of
\citet{Go:09}, was presented by \citet{Severyn:15a}. To derive their
lexicon, the authors trained an SVM classifier that used token n-grams
as features and then included n-grams with the greatest learned
feature weights into their final polarity list.
Graphical methods for corpus-based SLG were advocated by
\citet{Velikovich:10} and \citet{Feng:11}. The former work adapted
the label-propagation algorithm of \citet{Rao:09} by replacing the
average of all incident scores for a potential subjective term with
their maximum value. The latter approach induced a sentiment lexicon
using two popular techniques from information retrieval,
PageRank~\cite{Brin:98} and HITS~\cite{Kleinberg:99}.
For our experiments, we reimplemented the approaches
of~\citet{Takamura:05}, \citet{Velikovich:10}, \citet{Kiritchenko:14}
and~\citet{Severyn:15}, and applied these methods to the German
Twitter Snapshot~\cite{Scheffler:14}, a collection of 24~M German
microblogs, which we previously used for sampling one part of our
sentiment corpus.
We normalized all messages of this snapshot with the rule-based
normalization pipeline of~\citet{Sidarenka:13}, which will be
described in more detail in the next chapter, and lemmatized all
tokens with the \textsc{TreeTagger} of~\citet{Schmid:95}. Afterwards,
we constructed a collocation graph from all normalized lemmas that
appeared at least four times in the snapshot. For the method of
\citet{Takamura:05}, we additionally used \textsc{GermaNet} in order
to add more links between electrons. As in the previous experiments,
all hyper-parameters (including the size of the lexicons) were
fine-tuned on the development set by maximizing the macro-averaged
\F{}-score on these data.
The results of this evaluation are presented in
Table~\ref{snt-lex:tbl:corp-meth}.
% \cite{Lau:11} \citet{Bross:13} \citet{Tai:13} \citet{Yang:14}
% \citet{Bravo-Marquez:15}
% \begin{figure}[hbtp!]
% {
% \centering
% \includegraphics[width=\linewidth]{img/ising-energy-magnetization.png}
% }
% \caption{Energy (E) and magnetization (M) if the Ising spin model with
% respect to the hyper-parameter $\beta$.}\label{snt:fig:ising-spin-em}
% \end{figure}
\begin{table}[h]
\begin{center}
\bgroup\setlength\tabcolsep{0.1\tabcolsep}\scriptsize
\begin{tabular}{p{0.167\columnwidth} % first columm
>{\centering\arraybackslash}p{0.057\columnwidth} % second columm
*{9}{>{\centering\arraybackslash}p{0.07\columnwidth}} % next nine columns
*{2}{>{\centering\arraybackslash}p{0.055\columnwidth}}} % last two columns
\toprule
\multirow{2}*{\bfseries Lexicon} & %
\multirow{2}{0.06\columnwidth}{\bfseries \# of Terms} & %
\multicolumn{3}{c}{\bfseries Positive Expressions} & %
\multicolumn{3}{c}{\bfseries Negative Expressions} & %
\multicolumn{3}{c}{\bfseries Neutral Terms} & %
\multirow{2}{0.068\columnwidth}{\bfseries\centering Macro\newline \F{}} & %
\multirow{2}{0.068\columnwidth}{\bfseries\centering Micro\newline \F{}}\\
\cmidrule(lr){3-5}\cmidrule(lr){6-8}\cmidrule(lr){9-11}
& & Precision & Recall & \F{} & %
Precision & Recall & \F{} & %
Precision & Recall & \F{} & & \\\midrule
%% \multicolumn{9}{|c|}{\cellcolor{cellcolor}Existing Lexicons}\\\hline
% Class Precision Recall F-score
% positive 0.770601 0.101975 0.180115
% negative 0.567901 0.017139 0.033273
% neutral 0.963176 0.999227 0.980870
% Macro-average 0.767226 0.372780 0.398086
% Micro-average 0.962404 0.962216 0.962310
\textsc{Seed Set} & 20 & \textbf{0.771} & 0.102 & 0.18 & %
\textbf{0.568} & 0.017 & 0.033 & %
0.963 & \textbf{0.999} & \textbf{0.981} & %
0.398 & \textbf{0.962}\\
% Class Precision Recall F-score
% positive 0.646220 0.133510 0.221299
% negative 0.565217 0.029061 0.055280
% neutral 0.964071 0.998134 0.980807
% Macro-average 0.725169 0.386902 0.419129
% Micro-average 0.962261 0.962072 0.962167
TKM & 920 & 0.646 & \textbf{0.134} & \textbf{0.221} & %
0.565 & \textbf{0.029} & \textbf{0.055} & %
\textbf{0.964} & 0.998 & \textbf{0.981} & %
\textbf{0.419} & \textbf{0.962}\\
% Class Precision Recall F-score
% positive 0.764317 0.102269 0.180400
% negative 0.567901 0.017139 0.033273
% neutral 0.963181 0.999199 0.980860
% Macro-average 0.765133 0.372869 0.398178
% Micro-average 0.962384 0.962196 0.962290
VEL & 60 & 0.764 & 0.102 & 0.18 & %
\textbf{0.568} & 0.017 & 0.033 & %
0.963 & 0.999 & 0.98 & %
0.398 & \textbf{0.962}\\
% Class Precision Recall F-score
% positive 0.386437 0.105806 0.166127
% negative 0.567901 0.017139 0.033273
% neutral 0.963178 0.996092 0.979359
% Macro-average 0.639172 0.373012 0.392920
% Micro-average 0.959478 0.959290 0.959384
KIR & 320 & 0.386 & 0.106 & 0.166 & %
\textbf{0.568} & 0.017 & 0.033 & %
0.963 & 0.996 & 0.979 & %
0.393 & 0.959\\
% Class Precision Recall F-score
% positive 0.679764 0.101975 0.177345
% negative 0.567901 0.017139 0.033273
% neutral 0.963162 0.998820 0.980667
% Macro-average 0.736942 0.372644 0.397095
% Micro-average 0.962013 0.961825 0.961919
SEV & 60 & 0.68 & 0.102 & 0.177 & %
\textbf{0.568} & 0.017 & 0.033 & %
0.963 & \textbf{0.999} & \textbf{0.981} & %
0.397 & \textbf{0.962}\\
TKM $\cap$ VEL $\cap$ SEV & 20 & \textbf{0.771} & 0.102 & 0.18 & %
\textbf{0.568} & 0.017 & 0.033 & %
0.963 & \textbf{0.999} & \textbf{0.981} & %
0.398 & \textbf{0.962}\\
% Class Precision Recall F-score
% positive 0.592689 0.133805 0.218322
% negative 0.565217 0.029061 0.055280
% neutral 0.964063 0.997700 0.980593
% Macro-average 0.707323 0.386855 0.418065
% Micro-average 0.961850 0.961662 0.961756
TKM $\cup$ VEL $\cup$ SEV & 1,020 & 0.593 & \textbf{0.134} & 0.218 & %
0.565 & \textbf{0.029} & \textbf{0.055} & %
\textbf{0.964} & 0.998 & 0.98 & %
0.418 & \textbf{0.962}\\\bottomrule
\end{tabular}
\egroup{}
\caption[Results of corpus-based approaches]{Results of
corpus-based approaches\\ {\small TKM --- \citet{Takamura:05},
VEL --- \citet{Velikovich:10}, KIR --- \citet{Kiritchenko:14},
SEV --- \citet{Severyn:15}}}\label{snt-lex:tbl:corp-meth}
\end{center}
\end{table}
This time, we can observe a clear superiority of the system
of~\citet{Takamura:05}, which not only achieves the best recall and
\F{} for the positive and negative classes but also yields the highest
micro- and macro-averaged results for all three polarities.
% As expected, the best precision in recognizing polar terms is achieved
% by the manually compiled seed set, whose micro-averaged \F{}-result,
% however, is still identical to the one shown by the method of
% \citet{Takamura:05}.
The sizes and the scores of other lexicons, however, are much smaller
than the cardinalities and the results of the
\citeauthor{Takamura:05}'s polarity list. Moreover, these lexicons
can hardly outperform the original seed set on the negative class.
Because the last result was somewhat unexpected, we decided to
investigate the reasons for potential problems in these systems. A
closer look at their learning curves revealed that the macro-averaged
\F{}-values on the development data rapidly decreased from the very
beginning of their work. Since we considered the lexicon size as one
of the parameters, we rapidly stopped populating these lists. As a
consequence, only few highest ranked terms (all of which were
positive) were included into the final resource. As it turned out the
main reason for this degradation was the ambiguity of the seed terms:
While adapting the original seed list of~\citet{Turney:03} to German,
we translated the English word ``correct'' as ``richtig.'' This
German word, however, also has another reading---\emph{real} (as in
``ein richtiges Spiel'' [\emph{a real game}] or ``ein richtiger
Rennwagen'' [\emph{a real sports car}]), which was much more frequent
in the analyzed snapshot and typically appeared in a negative context,
\eg{} ``ein richtiger Bombenanschlag'' (\emph{a real bomb attack}) or
``ein richtiger Terrorist'' (\emph{a real terrorist}). As a
consequence, methods that relied on weak supervision had to deal with
extremely unbalanced training data (716,210 positive instances versus
92,592 negative ones) and got stuck in a local optimum from the very
beginning of their training.
\subsection{NWE-Based Methods}\label{subsec:snt:lex:nwe}
Finally, the last group of methods that we are going to explore in
this chapter are algorithms that operate on distributed vector
representations of words (neural word embeddings [NWEs]). First
introduced by~\citet{Bengio:03} and significantly improved by
\citet{Collobert:11} and \citet{Mikolov:13}, NWEs had a great
``tsunami''-like effect on many downstream NLP
applications~\cite{Manning:15}. Unfortunately, these advances have
largely bypassed the generation of sentiment lexicons, up to a few
exceptions introduced by the works of~\citet{Tang:14a} and
\citet{Vo:16}. In the former approach, the authors used a large
collection of weakly labeled tweets in order to learn hybrid word
embeddings. In contrast to standard word2vec
vectors~\cite{Mikolov:13} and purely task-specific
representations~\cite{Collobert:11}, such embeddings were optimized