-
Notifications
You must be signed in to change notification settings - Fork 1
/
5-text-analysis.Rmd
663 lines (502 loc) · 37.7 KB
/
5-text-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
# Text Analysis
```{r setup, include = FALSE}
library(tidyverse)
library(tidytext)
beer_data <- read_csv("data/ba_2002.csv")
```
At this point we've gone over the fundamentals of data wrangling and analysis in R. We really haven't touched on statistics and modeling, as that is outside the scope of this workshop, but I will be glad to chat about those with anyone outside of the main workshop if you have questions.
Now we're going to turn to our key focus: text analysis in R. We are going to focus on 3 broad areas:
1. Importing, cleaning, and wrangling text data and dealing with `character` type data.
2. Basic text modeling using the TF-IDF framework
3. Application: Sentiment Analysis
It's not a bad idea to restart your R session here. Make sure to save your work, but a clean `Environment` is great when we're shifting topics.
You can accomplish this by going to `Session > Restart R` in the menu.
Then, we want to make sure to re-load our packages and import our data.
```{r making sure that we have loaded all packages and data}
# The packages we're using
library(tidyverse)
library(tidytext)
# The dataset
beer_data <- read_csv("data/ba_2002.csv")
```
## Text Analysis Pt. I: Getting your text data
There are many sources of text data, which is one of the strengths of this approach. Some quick and obvious examples include:
1. Open-comment responses on a survey (sensory study in lab, remote survey, etc)
2. Comment cards or customer responses
3. Qualitative data (transcripts from interviews and focus groups)
4. Compiled/database from previous studies
5. Scraped data from websites and/or social media
The last (#5) is often the most interesting data source, and there are a number of papers in recent years that have discussed the use of this sort of data. I am not going to discuss how to obtain text data in this presentation, because it is its own, rich topic. However, you can [review my presentation and tutorial from Eurosense 2020](https://github.com/jlahne/text-mining-eurosense-2020) if you want a quick crash course on web scraping for sensory evaluation.
### Basic import and cleaning with text data
So far we've focused our attention on the `beer_data` data set on the various (`numeric`) rating variables. But the data in the `reviews` column is clearly more rich.
```{r getting a viewable example}
set.seed(2)
reviews_example <-
beer_data %>%
slice_sample(n = 5)
reviews_example %>%
select(beer_name, reviews) %>%
gt::gt()
```
<p>
However, this has some classic features of scraped text data--notice the "\"", for example--this is `HTML` placeholder for a quotation mark. Other common problems that you will see with text data include "parsing errors" between different forms of character data--the most common is between `ASCII` and `UTF-8` standards. You will of course also encounter errors with typos, and perhaps idiosyncratic formatting characteristic of different text sources: for example, the use of `@usernames` and `#hashtags` in Twitter data.
We will use the `textclean` package to do some basic clean up, but this kind of deep data cleaning may take more steps and will change from project to project.
```{r cleaning our example data}
library(textclean)
cleaned_reviews_example <-
reviews_example %>%
mutate(cleaned_review = reviews %>%
# Fix some of the HTML placeholders
replace_html(symbol = FALSE) %>%
# Replace sequences of "..."
replace_incomplete(replacement = " ") %>%
# Replace less-common or misformatted HTML
str_replace_all("#.{1,4};", " ") %>%
# Replace some common non-A-Z, 1-9 symbols
replace_symbol() %>%
# Remove non-word text like ":)"
replace_emoticon() %>%
# Remove numbers from the reviews, as they are not useful
str_remove_all("[0-9]"))
cleaned_reviews_example %>%
select(beer_name, cleaned_review) %>%
gt::gt()
```
<p>
This will take slightly longer to run on the full data set, but should improve our overall data quality.
```{r cleaning the full data set}
beer_data <-
beer_data %>%
mutate(cleaned_review = reviews %>%
replace_html(symbol = FALSE) %>%
replace_incomplete(replacement = " ") %>%
str_replace_all("#.{1,4};", " ") %>%
replace_symbol() %>%
replace_emoticon() %>%
str_remove_all("[0-9]"))
```
Again, this is not an exhaustive cleaning step--rather, these are some basic steps that I took based on common parsing errors I observed. Each particular text dataset will come with its own challenges and require a bit of time to develop the cleaning step.
### Where is the data in `character`? A *tidy* approach.
As humans who understand English, we can see that meaning is easily found from these reviews. For example, we can guess that the rating of `r cleaned_reviews_example$beer_name[2]` will be a low number (negative), while the rating of `r cleaned_reviews_example$beer_name[3]` will be much more positive. And, indeed, this is the case:
```{r the numeric part of our data}
cleaned_reviews_example %>%
select(beer_name, appearance:rating)
```
But what part of the structure of the `reviews` text actually tells us this? This is a complicated topic that is well beyond the scope of this workshop--we are going to propose a single approach based on **tokenization** that is effective, but is certainly not exhaustive.
If you are interested in a broader view of approaches to text analysis, I recommend the seminal textbook from [Jurafsky and Martin, *Speech and Language Processing*](https://web.stanford.edu/~jurafsky/slp3/), especially chapters 2-6. The draft version is freely available on the web as of the date of this workshop.
In the first half of this workshop we reviewed the `tidyverse` approach to data, in which we emphasized an approach to data in which:
> * Each variable is a column
> * Each observation is a row
> * Each type of observational unit is a table
This type of approach can be applied to text data **if we can specify a way to standardize the "observations" within the text**. We will be applying and demonstrating the approach defined and promoted by [Silge and Robinson in *Text Mining with R*](https://www.tidytextmining.com/index.html), in which we will focus on the following syllogism to create tidy text data:
<center>
`observation == token`
</center>
A **token** is a *meaningful* unit of text, usually but not always a single word, which will be the observation for us. In our example data, let's identify some possible tokens:
```{r what does text data look like?}
cleaned_reviews_example$cleaned_review[1]
```
We will start with only examining single words, also called "**unigrams**" in the linguistics jargon. Thus, `Medium`, `straw`, and `cloudy` are the first tokens in this dataset. We will mostly ignore punctuation and spacing, which means we are giving up some meaning for convenience.
We could figure out how to manually break this up into tokens with enough work, using functions like `str_separate()`, but happily there are a large set of competing `R` tools for tokenization, for example:
```{r a viewable tokenization example}
cleaned_reviews_example$cleaned_review[1] %>%
tokenizers::tokenize_words(simplify = TRUE)
```
Note that this function also (by default) turns every word into its lowercase version (e.g., `Medium` becomes `medium`) and strips out punctuation. This is because, to a program like `R`, upper- and lowercase versions of the same string are *not* equivalent. If we have reason to believe preserving this difference is important, we might want to rethink allowing this behavior.
Now, we have 46 observed tokens for our first review in our example dataset. But of course this is not super interesting--we need to apply this kind of transformation to every single one of our ~20,000 reviews. With some fiddling around with `mutate()` we might be able to come up with a solution, but happily this is where we start using the `tidytext` package. The `unnest_tokens()` function built into that package will allow us to transform our text directly into the format we want, with very little effort.
```{r tokenizing our entire data set into words}
beer_data_tokenized <-
beer_data %>%
# We may want to keep track of a unique ID for each review
mutate(review_id = row_number()) %>%
unnest_tokens(output = token,
input = cleaned_review,
token = "words") %>%
# Here we do a bit of extra cleaning
mutate(token = str_remove_all(token, "\\.|_"))
beer_data_tokenized %>%
# We show just a few of the columns for printing's sake
select(beer_name, rating, token)
```
The `unnest_tokens()` function actually applies the `tokenizers::tokenize_words()` function in an efficient, easy way: it takes a column of raw `character` data and then tokenizes it as specified, outputting a tidy format of 1-token-per-row. Now we have our observations (tokens) each on its own line, ready for further analysis.
In order to make what we're doing easier to follow, let's also take a look at our example data.
```{r what does tidy tokenized data look like}
cleaned_reviews_tokenized <-
cleaned_reviews_example %>%
unnest_tokens(input = cleaned_review, output = token) %>%
# Here we do a bit of extra cleaning
mutate(token = str_remove_all(token, "\\.|_"))
cleaned_reviews_tokenized %>%
filter(beer_name == "Kozel") %>%
# Let's select a few variables for printing
select(beer_name, rating, token)
```
When we use `unnest_tokens()`, all of the non-text data gets treated as information about each token--so `beer_name`, `rating`, etc are duplicated for each token that now has its own row. This will allow us to perform a number of types of text analysis.
### A quick note: saving data
We've talked about wanting to clear our workspace from `R` in between sessions to make sure our work and code are reproducible. But at this point we've done a lot of work, and we might want to save this work for later. Or, we might want to share our work with others. The `tidyverse` has a good utility for this when we're working with data frames and tibbles: the `write_csv()` function is the opposite of the `read_csv()` function we've gotten comfortable with: it takes a data frame and writes it into a `.csv` file that is easily shared with others or reloaded in a later session.
```{r example of writing csv files, eval=FALSE}
# Let's store our cleaned and tokenized reviews
write_csv(x = beer_data_tokenized, file = "data/tokenized_beer_reviews.csv")
```
## Text Analysis Pt. II: Basic text analysis approaches using tokens
We've now seen how to import and clean text, and to transform text data into one useful format for analysis: a tidy table of tokens. (There are certainly other useful formats, but we will not be covering them here.)
Now we get to the actually interesting part. We're going to tackle some basic but powerful approaches for parsing the meaning of large volumes of text.
### Word counts
A common approach for getting a quick, analytic summary of the nature of the tokens (observations) in our reviews might be to look at the frequency of word use across reviews: this is pretty closely equivalent to a Check-All-That-Apply (CATA) framework: we will look how often each term is used, and will then extend this to categories of interest to our analysis. So, for example, we might have hypotheses like the following that we could hope to investigate using word frequencies:
1. Overall, flavor words are the most frequently used words in reviews of beer online.
2. The frequency of flavor-related words will be different for major categories of beer, like "hop"-related words for IPAs and "chocolate"/"coffee" terms for stouts.
3. The top flavor words will be qualitatively more positive for reviews associated with the top 10% of ratings, and negative for reviews associated with the bottom 10%.
I will not be exploring each of these in this section of the workshop because of time constraints, but hopefully you'll be able to use the tools I will demonstrate to solve these problems on your own.
Let's start by combining our tidy data wrangling skills from `tidyverse` with our newly tidied text data from `unnest_tokens()` to try to test the first answer.
```{r what are the most frequent words in our beer data}
beer_data_tokenized %>%
# The count() function gives the number of rows that contain unique values
count(token) %>%
# get the 20 most frequently occurring words
slice_max(order_by = n, n = 20) %>%
# Here we do some wrangling for visualization
mutate(token = as.factor(token) %>% fct_reorder(n)) %>%
# let's visualize this just for fun
ggplot(aes(x = token, y = n)) +
geom_col() +
coord_flip() +
theme_bw() +
labs(x = NULL, y = NULL, title = "The 20 most frequent words are not that useful.")
```
Unfortunately for our first hypothesis, we have quickly encountered a common problem in token-based text analysis. The most frequent words in most languages are "helper" words that are necessary for linguistic structure and communication, but are not unique to a particular topic or context. For example, in English the articles `a`, and `the` tend to be the most frequent tokens in any text because they are so ubiquitous.
It takes us until the 15th most frequent word, `head`, to find a word that might have sensory or product implications.
So, in order to find meaning in our text, we need to find some way to look past these terms, whose frequency vastly outnumbers words we are actually interested in.
#### "stop words"
In computational linguistics, this kind of word is often called a **stop word**: a word that is functionally important but does not tell us much about the meaning of the text. There are many common lists of such stop words, and in fact the `tidytext` package provides one such list in the `stop_words` tibble:
```{r stop words look like:}
stop_words
```
The most basic approach to dealing with stop words is just to outright remove them from our data. This is easy when we've got a tidy, one-token-per-row structure.
We could use some version of `filter()` to remove stop words from our `beer_data_tokenized` tibble, but this is exactly what the `anti_join()` function (familiar to those of you who have used SQL) will do: with two tibbles, `X` and `Y`, `anti_join(X, Y)` will remove all rows in `X` that match rows in `Y`.
```{r what are the most frequent NON stop words in beer data}
beer_data_tokenized %>%
# "by = " tells what column to look for in X and Y
anti_join(y = stop_words, by = c("token" = "word")) %>%
count(token) %>%
# get the 20 most frequently occurring words
slice_max(order_by = n, n = 20) %>%
# Here we do some wrangling for visualization
mutate(token = as.factor(token) %>% fct_reorder(n)) %>%
# let's visualize this just for fun
ggplot(aes(x = token, y = n)) +
geom_col() +
coord_flip() +
theme_bw() +
labs(x = NULL, y = NULL, title = "These tokens are much more relevant.", subtitle = "We removed ~1200 stop words.")
```
Now we are able to address our very first hypothesis: while the most common term (beer) might be in fact a stop word for this data set, the next 11 top terms all seem to have quite a lot to do with flavor.
We can extend this approach just a bit to provide the start of one answer to our second question: are there different most-used flavor terms for different beer styles? We can start to tackle this by first defining a couple of categories (since there are a lot of different styles in this dataset):
```{r what beer styles are in our beer data}
beer_data %>%
count(style)
```
Let's just look at the categories I suggested, plus "lager". We will do this by using mutate to create a `simple_style` column:
```{r what words are associated with some basic beer styles}
# First, we'll make sure our approach works
beer_data %>%
# First we will set our style to lower case to make matching easier
mutate(style = tolower(style),
# Then we will use a conditional match to create our new style
simple_style = case_when(str_detect(style, "lager") ~ "lager",
str_detect(style, "ipa") ~ "IPA",
str_detect(style, "stout|porter") ~ "dark",
TRUE ~ "other")) %>%
count(simple_style)
# Then we'll implement the approach for our tokenized data
beer_data_tokenized <-
beer_data_tokenized %>%
mutate(style = tolower(style),
# Then we will use a conditional match to create our new style
simple_style = case_when(str_detect(style, "lager") ~ "lager",
str_detect(style, "ipa") ~ "IPA",
str_detect(style, "stout|porter") ~ "dark",
TRUE ~ "other"))
# Finally, we'll plot the most frequent terms for each simple_style
beer_data_tokenized %>%
# filter out stop words
anti_join(stop_words, by = c("token" = "word")) %>%
# This time we count tokens WITHIN simple_style
count(simple_style, token) %>%
# Then we will group_by the simple_style
group_by(simple_style) %>%
slice_max(order_by = n, n = 20) %>%
# Removing the group_by is necessary for some steps
ungroup() %>%
# A bit of wrangling for plotting
mutate(token = as.factor(token) %>% reorder_within(by = n, within = simple_style)) %>%
ggplot(aes(x = token, y = n, fill = simple_style)) +
geom_col(show.legend = FALSE) +
facet_wrap(~simple_style, scales = "free", ncol = 4) +
scale_x_reordered() +
theme_bw() +
coord_flip() +
labs(x = NULL, y = NULL, title = "Different (sensory) words tend to be used with different styles.") +
theme(axis.text.x = element_text(angle = 90))
```
### Term Frequency-Inverse Document Frequency (TF-IDF)
We have seen that removing stop words dramatically increases the quality of our analysis in regards to our specific answers. However, you may be asking yourself: "how do I know what *counts as* as a stop word?" For example, we see that "beer" is near the top of the list of most frequent tokens for all 4 categories we defined--it is not useful to us--but it is not part of the `stop_words` tibble. We could, of course, manually remove it (something like `filter(token != "beer")`). But removing each stop word manually requires us to make *a priori* judgments about our data, which is not ideal.
One approach that takes a statistical approach to identifying and filtering stop words without the need to define them *a priori* is the **tf-idf** statistic, which stands for **Term Frequency-Inverse Document Frequency**.
tf-idf requires not just defining tokens, as we have done already using the `unnest_tokens()` function, but defining "documents" for your specific dataset. The term "document" is somewhat misleading, as it is merely a categorical variable that tells us distinct units in which we want to compare the frequency of our tokens. So in our dataset, `simple_style` is one such document categorization (representing 4 distinct "documents"); we could also use `style`, representing 99 distinct categories ("documents"), or any other categorical variable.
The reason that we need to define a "document" variable for our dataset is that td-idf will answer, in general, the following question:
> Given a set of "documents", what tokens are most frequent *within* each unique document, but are not common *between* all documents?
Defined this way, it is clear that tf-idf has an empirical kinship with all kinds of within/between statistics that we use on a daily basis. In practice, tf-idf is a simple product of two empirical properties of each token in the dataset.
#### Term Frequency
The **tf** is defined simply for each combination of document and token as the raw count of that token in that document, divided by the sum of raw counts of all tokens in that document.
<center>
$\text{tf}(t, d) = \frac{count_{t}}{\sum_{t \in d}{count_{t}}}$
</center>
This quantity may be modified (for example, by using smoothing with $\log{(1 + count_{t})}$). Exact implementations can vary.
#### Inverse Document Frequency
The **idf** is an empirical estimate of how good a particular token is at distinguishing one "document" from another. It is defined as the logarithm of the total number of documents, divided by the the number of documents that contain the term in question.
<center>
$\text{idf}(t, D) = \log{\frac{N_D}{|\{{d \in D \ : \ t \in d\}|}}}$
</center>
Where $D$ is the set of documents in the dataset.
Remember that we obtain the tf-idf by *multiplying* the two terms, which explains the inverse fraction--if we imagined a "tf/df" with division, the definition of the document frequency might be more intuitive but less numerically tractable.
#### Applying tf-idf in `tidytext`
Overall, the **tf** part of tf-idf is just a scaled what we've been doing so far. But the **idf** part provides a measure of validation. For our very frequent terms, that we have been removing as stop words, the tf might be high (they occur a lot) but the idf will be extremely small--all documents use tokens like `the` and `is`, so they provide little discriminatory power between documents. Ideally, then, tf-idf will give us a way to drop out stop words without the need to specify them *a priori*.
Happily, we don't have to figure out a function for calculating tf-idf ourselves; `tidytext` provides the `bind_tf_idf()` function. The only requirement is that we have a data frame that provides a per-"document" count for each token, which we have already done above:
```{r tf-idf example}
# Let's first experiment with our "simple_styles"
beer_styles_tf_idf <-
beer_data_tokenized %>%
count(simple_style, token) %>%
# And now we can directly add on tf-idf
bind_tf_idf(term = token, document = simple_style, n = n)
beer_styles_tf_idf %>%
# let's look at some stop words
filter(token %in% c("is", "a", "the", "beer")) %>%
gt::gt()
```
We can see that, despite very high raw counts for all of these terms, the tf-idf is 0! They do not discriminate across our document categories at all, and so if we start looking for terms with high tf-idf, these will drop out completely.
Let's take a look at the same visualization we made before with raw counts, but now with tf-idf.
```{r what words are most important for our simple styles by tf-idf}
beer_styles_tf_idf %>%
# We still group by simple_style
group_by(simple_style) %>%
# Now we want tf-idf, not raw count
slice_max(order_by = tf_idf, n = 20) %>%
ungroup() %>%
# A bit of wrangling for plotting
mutate(token = as.factor(token) %>% reorder_within(by = tf_idf, within = simple_style)) %>%
ggplot(aes(x = token, y = tf_idf, fill = simple_style)) +
geom_col(show.legend = FALSE) +
facet_wrap(~simple_style, scales = "free", ncol = 4) +
scale_x_reordered() +
theme_bw() +
coord_flip() +
labs(x = NULL,
y = NULL,
title = "With tf-idf we get much more specific terms.",
subtitle = "For example, 'oatmeal' for stouts, 'grapefruity' for IPAs, and so on.") +
theme(axis.text.x = element_blank())
```
Thus, tf-idf gives us a flexible, empirical (data-based) model that will surface unique terms for us directly from the data. We can apply the same approach, for example, with a bit of extra wrangling, to see what terms are most associated with beers from the bottom and top deciles by rating (starting to address hypothesis #3):
```{r what words are associated with very good or bad beers by tf-idf}
beer_data_tokenized %>%
# First we get deciles of rating
mutate(rating_decile = ntile(rating, n = 10)) %>%
# And we'll select just the top 2 and bottom 2 deciles %>%
filter(rating_decile %in% c(1, 2, 9, 10)) %>%
# Then we follow the exact same pipeline to get tf-idf
count(rating_decile, token) %>%
bind_tf_idf(term = token, document = rating_decile, n = n) %>%
group_by(rating_decile) %>%
# Since we have more groups, we'll just look at 10 tokens
slice_max(order_by = tf_idf, n = 10) %>%
ungroup() %>%
# A bit of wrangling for plotting
mutate(token = as.factor(token) %>% reorder_within(by = tf_idf, within = rating_decile)) %>%
ggplot(aes(x = token, y = tf_idf, fill = rating_decile)) +
geom_col(show.legend = FALSE) +
facet_wrap(~rating_decile, scales = "free", ncol = 4) +
scale_x_reordered() +
scale_fill_viridis_c() +
theme_bw() +
coord_flip() +
labs(x = NULL,
y = NULL,
subtitle = "When we compare the top 2 and bottom 2 deciles, we see much more affective language.") +
theme(axis.text.x = element_blank())
```
And important point to keep in mind about tf-idf is that it is **data-based**: the numbers are only meaningful within the specific set of tokens and documents. Thus, if I repeat the above pipeline but include all 10 deciles, we will not see the same words (and we will see that the model immediately struggles to find meaningful terms at all). This is because I am no longer calculating with 4 documents, but 10. These statistics are best thought of as descriptive, especially when we are repeatedly using them to explore a dataset.
```{r tf-idf is data-based so it will change with "document" choice, echo = FALSE}
beer_data_tokenized %>%
# First we get deciles of rating
mutate(rating_decile = ntile(rating, n = 10)) %>%
# Then we follow the exact same pipeline to get tf-idf
count(rating_decile, token) %>%
bind_tf_idf(term = token, document = rating_decile, n = n) %>%
group_by(rating_decile) %>%
# Since we have more groups, we'll just look at 10 tokens
slice_max(order_by = tf_idf, n = 10) %>%
ungroup() %>%
# A bit of wrangling for plotting
mutate(token = as.factor(token) %>% reorder_within(by = tf_idf, within = rating_decile)) %>%
ggplot(aes(x = token, y = tf_idf, fill = rating_decile)) +
geom_col(show.legend = FALSE) +
facet_wrap(~rating_decile, scales = "free", ncol = 5) +
scale_x_reordered() +
scale_fill_viridis_c() +
theme_bw() +
coord_flip() +
labs(x = NULL,
y = NULL,
title = "tf-idf will change based on the documents being compared") +
theme(axis.text.x = element_blank())
```
We've now practiced some basic tools for first wrangling text data into a form with which we can work, and then applying some simple, empirical statistics to start to extract meaning. We'll discuss **sentiment analysis**, a broad suite of tools that attempts to impute some sort of emotional or affective weight to words, and then produce scores for texts based on these weights.
## Text Analysis Pt. III: Sentiment analysis
In recent years, sentiment analysis has gotten a lot of attention in consumer sciences, and it's one of the topics that has penetrated the furthest into sensory science, because its goals are easily understood in terms of our typical goals: quantifying consumer affective responses to a set of products based on consumption experiences.
In **sentiment analysis**, we attempt to replicate the human inferential process, in which we, as readers, are able to infer--without necessarily being able to explicitly describe *how* we can tell--the emotional tone of a piece of writing. We understand implicitly whether the author is trying to convey various emotional states, like disgust, appreciation, joy, etc.
In recent years, there have been a number of approaches to sentiment analysis. The current state of the art is to use some kind of machine learning, such as random forests ("shallow learning") or pre-trained neural networks ("deep learning"; usually convolutional, but sometimes recursive) to learn about a large batch of similar texts and then to process texts of interest. Not to plug myself too much, but we have a poster on such an approach at this conference, which may be of interest to some of you.
<center>
![Machine learning: just throw algebra at the problem! (via [XKCD](https://xkcd.com/1838))](https://imgs.xkcd.com/comics/machine_learning.png)
</center>
While these state-of-the-art approaches are outside of the scope of this workshop, older techniques that use pre-defined dictionaries of sentiment words are easy to implement with a tidy text format, and can be very useful for the basic exploration of text data. These have been used in recent publications to great effect, such as in the recent paper from [Luc et al. (2020)](https://doi.org/10.1016/j.foodqual.2019.103751).
The `tidytext` package includes the `sentiments` tibble by default, which is a version of that published by [Hu & Liu (2004)](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html).
```{r the included lexicons for sentiment in tidytext}
sentiments
count(sentiments, sentiment)
```
This is a very commonly used lexicon, which has shown to perform reasonably well on online-review based texts. The `tidytext` package also gives easy access to a few other lexicons through the `get_sentiment()` function, which may prompt for downloading some of the lexicons in order to avoid copyright issues. For the purpose of this workshop, we'll just work with the `sentiments` tibble, which can also be accessed using `get_sentiment("bing")`.
The structure of tidy data with tokens makes it very easy to use these dictionary-based sentiment analysis approaches. To do so, all we need to do is use a `left_join()` function, which is similar to the `anti_join()` function we used for stop words. In this case, for data tables `X` and `Y`, `left_join(X, Y)` finds rows in `Y` that match `X`, and imports all the columns from Y for those matches.
```{r using simple data wrangling to add sentiments to our beer data}
beer_sentiments <-
beer_data_tokenized %>%
left_join(sentiments, by = c("token" = "word"))
beer_sentiments %>%
select(review_id, beer_name, style, rating, token, sentiment)
```
It is immediately apparent how sparse the sentiment lexicons are compared to our actual data. This is one key problem with dictionary-based approaches--we often don't have application- or domain-specific lexicons, and the lexicons that exist are often not well calibrated to our data.
We can perform a simple `count()` and `group_by()` operation to get a rough sentiment score.
```{r our split-apply-combine approach for sentiment analysis}
sentiment_ratings <-
beer_sentiments %>%
count(review_id, sentiment) %>%
drop_na() %>%
group_by(review_id) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
sentiment_ratings
```
In this case, we're being *very* rough: we're not taking into account any of the non-sentiment words, for example. Nevertheless, let's take a look at what we've gotten:
```{r visualizing basic sentiment against rating, warning = FALSE, message = FALSE}
beer_data %>%
mutate(review_id = row_number()) %>%
left_join(sentiment_ratings) %>%
ggplot(aes(x = sentiment, y = rating)) +
geom_jitter(alpha = 0.2, size = 1) +
geom_smooth(method = "lm") +
coord_cartesian(xlim = c(-11, 30), ylim = c(1, 5)) +
theme_bw()
```
It does appear that there is a positive, probably non-linear relationship between sentiment and rating. We could do some reshaping (normalizing, possibly exponentiating or otherwise transforming the subsequent score) of the sentiment scores to get a better relationship.
```{r we can improve by doing some normalization}
beer_data <-
beer_data %>%
mutate(review_id = row_number()) %>%
left_join(sentiment_ratings)
beer_data %>%
select(beer_name, style, rating, sentiment) %>%
pivot_longer(cols = c(rating, sentiment), names_to = "scale", values_to = "value") %>%
group_by(style, scale) %>%
summarize(mean_value = mean(value, na.rm = TRUE),
se_value = sd(value, na.rm = TRUE) / sqrt(n())) %>%
group_by(scale) %>%
slice_max(order_by = mean_value, n = 10) %>%
ungroup() %>%
mutate(style = factor(style) %>% reorder_within(by = mean_value, within = scale)) %>%
ggplot(aes(x = style, y = mean_value, fill = scale)) +
geom_col(position = "dodge", show.legend = FALSE) +
scale_x_reordered() +
coord_flip() +
facet_wrap(~scale, scales = "free")
```
We can see that we get quite different rankings from sentiment scores and from ratings. It might behoove us to examine some of those mismatches in order to identify where they originate from.
```{r where are the disagreements between rating and sentiment}
# Here are the reviews where the ratings most disagree with the sentiment
beer_data %>%
# normalize sentiment and ratings so we can find the largest mismatch
mutate(rating_norm = rating / max(rating, na.rm = TRUE),
sentiment_norm = sentiment / max(sentiment, na.rm = TRUE),
diff = rating_norm - sentiment_norm) %>%
select(review_id, beer_name, sentiment_norm, rating_norm, diff, cleaned_review) %>%
slice_max(order_by = diff, n = 2) %>%
gt::gt()
# And here are the reviews where the sentiment most disagreed with the rating
beer_data %>%
# normalize sentiment and ratings so we can find the largest mismatch
mutate(rating_norm = rating / max(rating, na.rm = TRUE),
sentiment_norm = sentiment / max(sentiment, na.rm = TRUE),
diff = sentiment_norm - rating_norm) %>%
select(review_id, beer_name, sentiment_norm, rating_norm, diff, cleaned_review) %>%
slice_max(order_by = diff, n = 2) %>%
gt::gt()
```
Interestingly, we see here some evidence that some kind of normalization for review length might be necessary. Let's take one more look into this area:
```{r review length seems to affect rating and sentiment to different degrees, message = FALSE, warning = FALSE}
beer_data %>%
mutate(word_count = tokenizers::count_words(cleaned_review)) %>%
select(review_id, rating, sentiment, word_count) %>%
pivot_longer(cols = c(rating, sentiment)) %>%
ggplot(aes(x = word_count, y = value, color = name)) +
geom_jitter() +
geom_smooth(method = "lm", color = "black") +
facet_wrap(~name, scales = "free") +
theme_bw() +
theme(legend.position = "none")
```
It appears that both rating and sentiment are positively correlated with word count, but that (unsurprisingly) sentiment is more strongly correlated. Going forward, we might want to consider some way to normalize for word count in our sentiment scores.
## Text Analysis Pt. IV: *n*-gram models
An obvious and valid critique of sentiment analysis as implemented above is the focus on single words, or **unigrams**. It is very clear that English (and most languages) have levels of meaning that go beyond the single word, so focusing our analysis on single words as are tokens/observations will lose some (or a lot!) of meaning. It is common in modern text analysis to look at units beyond the single word: bi-, tri-, or *n*-grams, for example, or so-called "skipgrams" for the popular and powerful `word2vec` family of algorithms. Tools like convolutional and recursive neural networks will look at larger chunks of texts, combined in different ways.
Let's take a look at *bigrams* in our data-set: tokens that are made up of 2 adjacent words. We can get them using the same `unnest_tokens()` function, but with the request to get `token = "ngrams"` and the number of tokens set to `n = 2`.
```{r tokenizing for bigrams}
beer_bigrams <-
beer_data %>%
unnest_tokens(output = bigram, input = cleaned_review, token = "ngrams", n = 2)
beer_bigrams %>%
filter(review_id == 1) %>%
select(beer_name, bigram)
```
We can see that, in this first review, we already have a bigram with strong semantic content: `not sure` should tell us that looking only at the unigram `sure` will give us misleading results.
We can use some more `tidyverse` helpers to get a better idea of the scale of the problem. The `separate()` function breaks one character vector into multiple columns, according to whatever separating characters you specify.
```{r what are the most common bigrams with "not"?}
beer_bigrams %>%
separate(col = bigram, into = c("word1", "word2"), sep = " ") %>%
# Now we will first filter for bigrams starting with "not"
filter(word1 == "not") %>%
# And then we'll count up the most frequent pairs of negated biterms
count(word1, word2) %>%
arrange(-n)
```
We can see that the 4th most frequent pair is "not bad", which means that `bad`, classified as a `negative` token in the `sentiments` tibble, is actually being *overcounted* as negative in our simple unigram analysis.
We can actually look more closely at this with just a little bit more wrangling:
```{r what are our most important SENTIMENT bigrams with "not"}
beer_bigrams %>%
separate(col = bigram, into = c("word1", "word2"), sep = " ") %>%
filter(word1 == "not") %>%
inner_join(sentiments, by = c("word2" = "word")) %>%
count(word1, word2, sort = TRUE)
```
We could use such an approach to try to get a handle on the problem of context in our sentiment analysis, by inverting the sentiment score of any word that is part of a "not bigram". Of course, such negations can be expressed over entire sentences, or even flavor the entire text (as in the use of sarcasm). There are entire analysis workflows built on this kind of context flow.
For example, the [`sentiment` package](https://github.com/trinker/sentimentr), which was used in the recent [Luc et al. (2020) paper on free JAR analysis](https://doi.org/10.1016/j.foodqual.2019.103751), takes just such an approach. We can quickly experiment with this package as a demonstration.
```{r more sophisticated lexicon-based sentiment analysis with sentimentr}
library(sentimentr)
polarity_sentiment <-
cleaned_reviews_example %>%
select(beer_name, rating, cleaned_review) %>%
get_sentences() %>%
sentiment_by(by = "beer_name")
polarity_sentiment
```
We can visualize how the `sentimentr` algorithm "*sees*" the sentences by using the `highlight(polarity_sentiment)` call, but since this outputs `HTML` I will embed it as an image here.
<center>
![`sentimentr` sentence-based sentiment analysis, using the dictionary-based, polarity-shifting algorithm.](img/sentimentr-highlight.png)
</center>
As a note, this is only one example of such an algorithm, and it may or may not be the best one. While Luc et al. (2020) found good results, we (again plugging our poster) found that `sentimentr` didn't outperform simpler algorithms. However, this may have to do with the set up of lexicons, stop-word lists, etc.