-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAS06_Visualizing-Text-Data_ref.Rmd
251 lines (198 loc) · 11.9 KB
/
AS06_Visualizing-Text-Data_ref.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: "AS06_Visualizing-Text-Data_ref"
author: "曾子軒 Teaching Assistant"
date: "2021/04/27"
output:
html_document:
number_sections: no
theme: united
highlight: tango
toc: yes
toc_depth: 4
toc_float:
collapsed: no
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, results = 'hold', comment = '#>', error = TRUE)
```
## 作業目的: Data Visualization (02) Text
這份作業希望能夠讓你熟悉中文文字處理,並執行基本的文字相關分析,再將結果以圖表呈現。過程中會運用到過去幾週影片中的 document-level, word-level text analysis, regular expression, and text mining.
這次的作業使用維基文庫提供的歷任中華民國總統就職演說。因為總統就職演說本身代表了每一屆總統任期的,以其重要性,因此國內外媒體時常使用演說的內文當作素材,利用文字探勘的技巧寫出報導,以 2020 年為例,大家可以參考中央社的[蔡總統關心什麼 文字會說話
](https://www.cna.com.tw/project/20200520-inauguraladdress/) 以及 readr 的 [少了「年輕人」多了「防疫」:臺灣歷屆民選總統就職演說字詞分析
](https://www.readr.tw/post/2433)。國外的則可以參考 [“I Have The Best Words.” Here's How Trump’s First SOTU Compares To All The Others.](https://www.buzzfeednews.com/article/peteraldhous/trump-state-of-the-union-words#.tbr8QJvA7) by BuzzFeed, [Word Aanalysis of 2016 Presidential debates - Clinton vs. Trump](http://mkweb.bcgsc.ca/debates2016/) by Martin Krzywinski, and [Trump used words like 'invasion' and 'killer' to discuss immigrants at rallies 500 times: USA TODAY analysis](https://www.usatoday.com/story/news/politics/elections/2019/08/08/trump-immigrants-rhetoric-criticized-el-paso-dayton-shootings/1936742001/) by USA today.
小小的反思:直接用[資料](https://zh.wikisource.org/wiki/%E4%B8%AD%E8%8F%AF%E6%B0%91%E5%9C%8B%E7%AC%AC%E5%8D%81%E5%9B%9B%E4%BB%BB%E7%B8%BD%E7%B5%B1%E5%B0%B1%E8%81%B7%E6%BC%94%E8%AA%AA)、直接用斷詞結果(台灣 vs. 臺灣)可能會出錯喔!
## 作業: Data Visualization (02) Text
```{r message=FALSE, warning=FALSE}
### 這邊不要動
library(tidyverse)
library(jiebaR)
library(tidytext)
df_speech <- read_csv("data/AS06/df_speech.csv")
### 給你看資料長這樣
df_speech %>% glimpse()
```
### 0. 斷詞:
請利用 `library(jiebaR)` 斷詞,過程中也要保留詞性的欄位。
```{r message=FALSE, warning=FALSE}
### your code
### segment
cutter <- worker("tag", stop_word = "data/segment/df_stopword.txt")
vector_word = c("中華民國", "蔡英文", "李登輝", "蔣中正", "蔣經國", "李登輝", "陳水扁", "馬英九")
new_user_word(cutter, words = "data/segment/dict_jieba.txt")
new_user_word(cutter, words = "data/segment/hand.txt")
new_user_word(cutter, words = "data/segment/news.txt")
new_user_word(cutter, words = vector_word)
reg_space <- "%E3%80%80" %>% curl::curl_escape()
### text part
df_speech_seg <-
df_speech %>%
mutate(text = str_replace_all(text, "台灣|臺灣", "臺灣")) %>%
mutate(text = str_remove_all(text, "\\n|\\r|\\t|:| | ")) %>%
mutate(text = str_remove_all(text, reg_space)) %>%
mutate(text = str_remove_all(text, "[a-zA-Z0-9]+")) %>%
mutate(text_segment = purrr::map(text, function(x)segment(x, cutter))) %>%
mutate(text_POS = purrr::map(text_segment, function(x)names(x)))
# df_speech_seg %>% write_rds("data/AS06/df_speech_seg.rds")
```
### 1. 整體熱門詞彙:
請先找出所有總統演說當中出現次數最高的 10 個詞彙,接著計算每屆總統演說時,這些詞彙出現的次數,最後將結果畫成圖表。因為斷詞結果會因為預先載入的詞典有所不同,所以底下的示意圖參考即可,請盡量呈現 **有意義的** 詞彙!
```{r message=FALSE, warning=FALSE}
### your code
# df_speech_seg <- read_rds("data/AS06/df_speech_seg.rds")
df_speech_seg_unnest <- df_speech_seg %>%
unnest(c(text_segment, text_POS))
df_term_seg_count <- df_speech_seg_unnest %>%
count(id, term, year, text_segment, text_POS) %>%
filter(str_length(text_segment) > 1)
df_seg_count_top <- df_term_seg_count %>%
group_by(text_segment, text_POS) %>% summarise(n = sum(n)) %>%
arrange(desc(n)) %>% ungroup() %>% filter(! text_segment %in% c("一個")) %>%
slice(1:10) %>% select(text_segment)
df_term_seg_count %>%
inner_join(df_seg_count_top) %>%
mutate(year = as.factor(year)) %>%
ggplot(aes(x = year, y = text_segment, fill = n)) + geom_tile() +
theme_bw() +
scale_linetype(guide = "none") +
scale_fill_gradient(low = "white", high = "red")+
labs(x= "年份",y= "詞彙", title = "歷屆總統演說使用熱詞", fill = "次數") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black")) +
theme(legend.position="bottom") +
theme(plot.title = element_text(family = "Noto Sans CJK TC Medium", face = "plain", size = 18),
legend.text = element_text(family = "Noto Sans CJK TC Medium", face = "plain"),
text = element_text(family = "Noto Sans CJK TC Medium"))
### your result should be
# 自己畫就好唷
```
```{r, echo=FALSE}
knitr::include_graphics('data/AS06/plot01.png')
```
### 2. 各自熱門詞彙:
請先找出各個總統演說中,出現次數最高的 10 個詞彙,並且將結果畫成圖表。因為斷詞結果會因為預先載入的詞典有所不同,所以底下的示意圖參考即可,請盡量呈現 **有意義的** 詞彙!
```{r message=FALSE, warning=FALSE}
### your code
df_president_seg_count <- df_speech_seg_unnest %>%
count(president, text_segment, text_POS) %>%
filter(str_length(text_segment) > 1)
df_president_seg_count_top <- df_president_seg_count %>% group_by(president) %>%
arrange(president, desc(n)) %>% mutate(rn = row_number()) %>%
filter(rn <= 10) %>% ungroup() %>%
group_by(president) %>% arrange(president, n) %>% ungroup() %>%
mutate(president = fct_relevel(as.factor(president), "蔣中正", "蔣經國", "李登輝", "陳水扁", "馬英九", "蔡英文"))
df_president_seg_count_top %>%
mutate(text_segment = reorder_within(text_segment, n, president)) %>%
ggplot(aes(x = text_segment, y = n)) + geom_col() +
facet_wrap(president ~ ., scales = "free") +
coord_flip() +
theme_bw() +
scale_linetype(guide = "none") +
scale_x_reordered() +
scale_fill_gradient(low = "white", high = "red")+
labs(x= "年份",y= "詞彙", title = "歷屆總統演說使用熱詞", fill = "次數") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black")) +
theme(legend.position="bottom") +
theme(plot.title = element_text(family = "Noto Sans CJK TC Medium", face = "plain", size = 18),
legend.text = element_text(family = "Noto Sans CJK TC Medium", face = "plain"),
text = element_text(family = "Noto Sans CJK TC Medium"))
### your result should be
# 自己畫就好唷
```
```{r, echo=FALSE}
knitr::include_graphics('data/AS06/plot02.png')
```
### 3. TF-IDF:
請先篩掉各個總統演說中出現次數小於 5 的詞彙,接著計算 TF-IDF (不知道這是什麼的話請看老師影片!),最後將結果畫成圖表。因為斷詞結果會因為預先載入的詞典有所不同,所以底下的示意圖參考即可,請盡量呈現 **有意義的** 詞彙!
```{r message=FALSE, warning=FALSE}
### your code
df_president_tfidf <- df_president_seg_count %>% filter(n > 5) %>%
bind_tf_idf(text_segment, president, n) %>%
group_by(president) %>% arrange(-tf_idf) %>%
slice(1:10) %>% ungroup() %>%
mutate(president = fct_relevel(as.factor(president), "蔣中正", "蔣經國", "李登輝", "陳水扁", "馬英九", "蔡英文")) %>%
mutate(text_segment = fct_reorder(text_segment, tf_idf))
df_president_tfidf %>%
mutate(text_segment = reorder_within(text_segment, tf_idf, president)) %>%
ggplot(aes(x = text_segment, y = tf_idf)) + geom_col() +
facet_wrap(president ~ ., scales = "free") +
coord_flip() +
theme_bw() +
scale_x_reordered() +
labs(x= "年份",y= "詞彙", title = "歷屆總統演說使用熱詞", fill = "次數") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black")) +
theme(legend.position="bottom") +
theme(plot.title = element_text(family = "Noto Sans CJK TC Medium", face = "plain", size = 18),
legend.text = element_text(family = "Noto Sans CJK TC Medium", face = "plain"),
text = element_text(family = "Noto Sans CJK TC Medium"))
### your result should be
# 自己畫就好唷
```
```{r, echo=FALSE}
knitr::include_graphics('data/AS06/plot03.png')
```
### 4. 捉對廝殺:
請先留下蔡英文和馬英九的用詞,接著計算兩者用詞數量差異最大各自前十名的詞彙,最後將結果畫成圖表。因為斷詞結果會因為預先載入的詞典有所不同,所以底下的示意圖參考即可,請盡量呈現 **有意義的** 詞彙!
```{r message=F, warning=F}
### your code
df_ying_seg_count <- df_speech_seg_unnest %>%
filter(president %in% c("馬英九", "蔡英文")) %>%
count(president, text_segment) %>%
filter(str_length(text_segment) > 1)
df_ying_seg_diff <- df_ying_seg_count %>%
pivot_wider(names_from = president, values_from = n, values_fill = list(n = 0)) %>%
mutate(diff_tsai = `蔡英文` - `馬英九`, diff_ma = -diff_tsai)
df_ying_seg_diff %>% arrange(desc(diff_tsai)) %>% slice(1:10) %>%
select(text_segment, diff = diff_tsai) %>% mutate(president = "蔡英文") %>%
bind_rows(
df_ying_seg_diff %>% arrange(desc(diff_ma)) %>% slice(1:10) %>%
select(text_segment, diff = diff_ma) %>% mutate(president = "馬英九")
) %>%
mutate(diff2 = if_else(president == "馬英九", -diff, diff)) %>%
mutate(text_segment = reorder(text_segment, diff2)) %>%
ggplot(aes(x = diff2, y = text_segment, fill = president)) + geom_col() +
theme_bw() +
scale_x_continuous(limits = c(-50, 50)) +
scale_fill_manual(values = c("#1B9431", "#000095")) +
labs(x= "次數",y= "詞彙", title = "雙英對決:馬英九與蔡英文使用次數差異最大詞彙", fill = "總統") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black")) +
theme(legend.position="bottom") +
theme(plot.title = element_text(family = "Noto Sans CJK TC Medium", face = "plain", size = 18),
legend.text = element_text(family = "Noto Sans CJK TC Medium", face = "plain"),
text = element_text(family = "Noto Sans CJK TC Medium"))
### your result should be
# 自己畫就好唷
```
```{r, echo=FALSE}
knitr::include_graphics('data/AS06/plot04.png')
```
### 結語
https://collabin.netlify.app/don/tongxinglian-in-samesex-marriage-corpora-2/
老師昨天推薦的實例,覺得結尾說得很好
"最近剛好在一堂課的讀本中讀到一句話:
Computer-assisted text analysis [is] an activity best employed not in the service of a heightened critical objectivity, but as one that embraces the possibilities of that deepened subjectivity upon which critical insight depends.
這是 Stephen Ramsay 在其文章 “Toward an algorithmic criticism” (2003. Literary and Linguistic Computing, 18(2): 167-174)中所討論的,究竟電腦與演算法在言談分析和文本分析中扮演著什麼樣的角色。他認為演算法的嚴密與正確並不是分析的終點,客觀性與實證主義式的真理也不是所追求的目標,更重要的在於演算法是否讓分析者看見了原本看不見的,從而能針對文本做出更深刻的討論。(後來發現他有寫成一本書:Reading Machines: Toward an Algorithmic Criticism)
在練習用不同的工具分析文本的過程中,好像真的看到了工具所看到的某種世界。"
「讓分析者看見了原本看不見的,從而能針對文本做出更深刻的討論」讚喔!