-
Notifications
You must be signed in to change notification settings - Fork 0
/
slides.Rmd
490 lines (283 loc) · 14 KB
/
slides.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
---
title: "Querying Twitter API Data in Spark"
subtitle: "System Investigation Project - Part 4"
author: "Hannah Luebbering"
date: "June 07, 2022"
output:
xaringan::moon_reader:
css: "assets/main3.css"
# knit: pagedown::chrome_print
knit: (function(inputFile, encoding) {rmarkdown::render(inputFile, encoding = encoding, output_dir = "docs") })
bibliography: "assets/references.bib"
nocite: "@*"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, fig.align = "center")
library(dplyr)
library(twitteR)
library(tidyverse)
library(kableExtra)
library(lubridate)
library(scales)
library(tidyr)
library(ggplot2)
library(tidytext)
library(quanteda)
library(hrbrthemes)
library(httr)
library(devtools)
library(plyr)
library(readr)
library(plotly)
library(rtweet)
library(syuzhet)
library(textfeatures)
library(gridExtra)
library(patchwork)
library(ggpubr)
library(pagedown)
source("data.R")
```
<script src="min.js"></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
## Introduction.
For the final project, I implement several queries over the selected data set on the Spark system. The implementation provides a demonstration of the implementation and code.
[Link to Video](https://drive.google.com/file/d/1w6Dsk1ibFuVRmypqrdXuXlqmqpObAz5L/view?usp=sharing)
<!-- This demonstration includes a recorded video (5 minutes) and a set of slides to accompany the video. The slides document the project's code, data storage and implementation. The goal is to communicate what the project accomplished. -->
### System Overview
<div class = "datasource2">
<code>Apache Spark</code> is an open-source engine for large-scale parallel data processing known for its speed, ease of use, and cutting-edge analytics. It provides high-level APIs in general-purpose programming languages such as Scala, Python, and R, as well as an optimization engine that supports standard data analysis methods. It also supports many built-in libraries that fulfill users' needs, including Spark SQL for DataFrames, Spark pandas API for pandas workloads, MLlib for machine learning, Graphx for graph visualizations, and Structured Streaming for workflow processing.
</div>
<div class = "datasource2">
<code>Azure Databricks</code> is an analytics platform based on Microsoft Azure cloud services, providing the latest versions of Apache Spark and allowing consistent integration with open source libraries. Built with Spark capabilities, Azure Databricks provides a cloud platform with interactive workspaces and fully managed Spark clusters. Such capabilities allow users to create and configure clusters in seconds, gain instant access to Apache Spark capabilities, and build quickly via Azure.
</div>
---
### Data Description
<div class = "roundedlist22">
<span class="myhighlight">Twitter (Elon Musk 2015-2022): </span> Dataset of Elon Musk’s most recent Tweets during 2015-2022, stored in CSV format, where each row represents a separate tweet object. All Tweets are collected, parsed, and plotted using the Twitter API and <code>rtweet</code> package in R. In total, there are more than ten-thousand tweets in this dataset, including retweets, replies, and quotes. All objects are to go into a single database.
</div>
```{r, out.width="100%"}
kable(dfvar, escape = FALSE, col.names = NULL) %>% kable_styling(font_size = 15, bootstrap_options = c("striped", "hover"), html_font = "Roboto Condensed") %>% column_spec(c(1, 3, 5, 7), extra_css = c("font-weight: 700; font-family: Roboto; font-size: 16px;"), width = "1cm") %>%
add_header_above(c("Data Set Variables" = 8), align = "l", extra_css = c("text-transform: uppercase;"), font_size = 15)
```
---
<span class="myhighlight2">Data Set Preview: </span>
```{r}
tweetDF2 <- tweetDF %>% select(-status_id, -symbols)
kable(tweetDF2, escape = F) %>%
kable_styling(font_size = 13, html_font = "Roboto Condensed", bootstrap_options = c("striped")) %>%
column_spec(c(3:4), width_max = "2.5cm", extra_css = c("font-size: 11px;")) %>%
column_spec(1, extra_css = c("font-size: 11.25px;"), width_min = "2cm") %>%
row_spec(0, extra_css = c("font-weight: 700; font-size: 11px; padding: 3.5pt; text-transform: lowercase;"))
```
---
## Project Description
For this project, I set out to explore and answer the following questions and ideas about the selected Twitter data set.
<div class = "datasource2">
<ol>
<li><code>(path finding)</code> Display the thread (mentions) of tweets (the tweet, time, id, in mention to id, user name with their screen name) posted by Elon Musk with screen_name in the order in which they were posted.</li>
<li><code>(hashtags)</code> Which hashtags does Musk use the most, and how many tweets are associated with these hashtags?</li>
<li><code>(topics)</code> What word does Musk mention the most in his tweets? What company products does Musk mention the most in his tweets? Products include Falcon 9, Starlink Satellites, Model 3 cars, etc. </li>
<li><code>(trending)</code> Are there any trends of what Musk tweets about the company?</li>
<li><code>(nature of engagement)</code> What is the percentage of different types of tweets (simple tweet, reply, retweet, quoted tweet) to their overall number of tweets?</li>
</ol>
</div>
---
## Part 4. Implementation
<span class="myhighlight2">Building a Databricks workspace using an Apache Spark cluster. </span>
First, we create a Databricks workspace from the Azure portal and then launch the workspace, which redirects us to the interactive Databricks portal. We create a Spark cluster from the Databricks interactive workspace and configure a notebook on the cluster. In the notebook, we use <code>PySpark</code> to read data from a dataset into a Spark DataFrame. Using the Spark DataFrame, we can run a Spark SQL job to query the data.
<div class = "datasource">
```{r, out.width="50%", fig.show='hold', fig.align='default'}
knitr::include_graphics("assets/workspace.jpeg")
knitr::include_graphics("assets/myCluster.jpg")
```
</div>
```python
twitter_df = spark.read.csv(path = 'dbfs:/FileStore/dfclean.csv',
header = "true", multiLine = "true")
# Register table so it is accessible via SQL Context
twitter_df.createOrReplaceTempView('twitter_df')
```
---
## Query 1. <code>(path finding)</code>
<div class = "roundedlist">
Display the thread (mentions) of tweets (the tweet, time, id, in mention to id, user name with their screen name) posted by Elon Musk with screen_name in the order in which they were posted.
</div>
```python
from pyspark.sql.functions import col
mentionDF = spark.sql(
"SELECT screen_name, \
created_at, \
text, \
mentions_user_id, \
mentions_screen_name \
FROM twitter_df"
)
mentionDF.filter(col("mentions_user_id") != "NA") \
.write \
.format('csv') \
.mode('overwrite') \
.option("overwriteSchema", "true") \
.saveAsTable("mentionDF")
mentionDF.createOrReplaceTempView('mentionDF')
```
---
```python
display(spark.read.table("mentionDF"))
```
```{r}
reply.data <- tweets.clean %>% dplyr::filter(mentions_user_id != 'NA') %>% dplyr::select(screen_name, text, created_at, mentions_user_id, mentions_screen_name) %>% head(4)
kable(reply.data, escape = FALSE) %>% kable_styling(font_size = 14, bootstrap_options = c("striped", "hover"), html_font = "Roboto Condensed")
```
<span class="myhighlight2">SQL: </span>
```sql
SELECT mentions_screen_name,
COUNT(*) AS n
FROM mentionDF
WHERE mentions_screen_name != 'NA'
GROUP BY mentions_screen_name
SORT BY n DESC;
```
---
### Graphic Report 1.
```{r, fig.show='hold', out.width="100%"}
dfMent <- data.frame(dfMentions) %>% head(12)
colnames(dfMent) <- c("mentions_screen_name", "Freq")
plotMentions <- dfMent %>%
ggplot(aes(y = mentions_screen_name, x = Freq, fill = Freq)) +
geom_bar(stat = "identity", color = "black", cex = 0.25) +
xlab(NULL) + ylab(NULL) +
theme_ipsum_rc(base_size = 9, plot_margin = margin(20,8,20,8),
plot_title_size = 10) +
theme(legend.position = "none")
stable.p <- ggtexttable(dfMent, rows = NULL,
theme = ttheme("mBlue", base_size = 8))
ggarr1 <- ggarrange(plotMentions, stable.p, widths = c(2, 1), heights = c(1, 1), labels = c("Top 12 Most Mentioned Twitter Users"), font.label = list(size = 10, family = "Roboto"))
ragg::agg_png(filename = "assets/report1.png", width = 7087, height = 4895, units = "px", res = 900)
ggarr1
invisible(dev.off())
knitr::include_graphics("assets/report1.png")
```
---
## Query 2. <code>(hashtags)</code>
<div class = "roundedlist">
Which hashtags does Musk use the most, and how many tweets are associated with these hashtags?
</div>
```python
hashtagsDF = spark.sql("SELECT \
hashtags, \
COUNT(*) AS n \
FROM twitter_df \
GROUP BY hashtags")
hashtagsDF.filter(col("hashtags") != "NA") \
.write \
.format('csv') \
.mode('overwrite') \
.option("overwriteSchema", "true") \
.saveAsTable("hashtagsDF")
display(spark.read.table("hashtagsDF"))
hashtagsDF.createOrReplaceTempView('hashtagsDF')
```
---
### Graphic Report 2.
```{r, fig.show='hold', out.width="100%"}
dfHashtags <- data.frame(dfHashtags) %>% head(12)
colnames(dfHashtags) <- c("Hashtag", "Freq")
plotHashtags <- dfHashtags %>%
ggplot(aes(y = Hashtag, x = Freq, fill = Freq)) +
geom_bar(stat = "identity", color = "black", cex = 0.25) +
xlab(NULL) + ylab(NULL) +
theme_ipsum_rc(base_size = 9, plot_margin = margin(20,8,20,8),
plot_title_size = 10) +
theme(legend.position = "none")
table.htags <- ggtexttable(dfHashtags, rows = NULL, theme = ttheme("mBlue", base_size = 8))
ggarr2 <- ggarrange(plotHashtags, table.htags, widths = c(2, 1), heights = c(1, 1), labels = c("Top 12 Most Used Hashtags"), font.label = list(size = 10, family = "Roboto"))
ragg::agg_png(filename = "assets/report2.png", width = 7087, height = 4595, units = "px", res = 900)
ggarr2
invisible(dev.off())
knitr::include_graphics("assets/report2.png")
```
---
## Query 3. <code>(topics)</code>
<div class = "roundedlist">
What word does Musk mention the most in his tweets? What company products does Musk mention the most in his tweets?
</div>
```python
import pyspark.sql.functions as f
data_df = twitter_df.select('text', 'status_id', 'created_at')
# Count and group word frequencies on text, when split by space comma
data_df.withColumn('word', f.explode(f.split(f.col('text'), ' '))) \
.groupBy('word') \
.count() \
.sort('count', ascending=False) \
.show()
```
---
### Graphic Report 3.
```{r, out.width="100%"}
dfSentWords <- data.frame(df.words) %>% head(15)
plotWords <- dfSentWords %>%
ggplot(aes(y = reorder(word, -n), x = n, fill = n)) +
geom_histogram(stat = "identity", color = "black", cex = 0.25) +
xlab(NULL) + ylab(NULL) +
theme_ipsum_rc(base_size = 9, plot_margin = margin(20,8,20,8),
plot_title_size = 10) +
theme(legend.position = "none")
table.words <- ggtexttable(dfSentWords, rows = NULL, theme = ttheme("mBlue", base_size = 8))
ggarr3 <- ggarrange(plotWords, table.words, widths = c(2, 1), heights = c(1, 1), labels = c("Top 15 Most Used Words"), font.label = list(size = 10, family = "Roboto"))
ragg::agg_png(filename = "assets/report3.png", width = 7087, height = 4595, units = "px", res = 900)
ggarr3
invisible(dev.off())
knitr::include_graphics("assets/report3.png")
```
---
```{r, out.width="100%"}
ggplotly(tt_sent)
```
---
## Query 4. <code>(tweet trends)</code>
<div class = "roundedlist">
Are there any trends of when Elon Musk tweets?
</div>
```python
from pyspark.sql.types import DateType
twitter_df.withColumn("created_at", twitter_df.created_at.cast(DateType()))
```
```sql
SELECT WEEKDAY(created_at) as created_weekday,
COUNT(*) as n
FROM twitter_df
GROUP BY created_weekday
ORDER BY created_weekday DESC
```
```{r, out.width="70%"}
plot.weekday <- ggplot(dftweets, aes(x = week)) + geom_bar(stat = "count", width = 0.75, fill = "cornflowerblue", color = "navy") + theme_ipsum_rc(base_size = 8)
ragg::agg_png(filename = "assets/report4.png", width = 6587, height = 3595, units = "px", res = 900)
plot.weekday
invisible(dev.off())
knitr::include_graphics("assets/report4.png")
```
---
## Conclusion
With Azure, we can run a Spark job on the Databricks Workspace using the Azure portal. In short, Azure allows us to create a Spark dataframe and then use SQL to query the data by configuring a notebook in Databricks and running a Spark SQL job on the data.
The analytics platform includes an easy setup, streamlined workflow, and an integrated Azure Workspace, allowing users to work in a single, easy-to-use environment and quickly schedule Spark code developed on Apache Spark.
An Azure Databricks table is a compiled collection of structured or unstructured data, and a Databricks database (also known as a schema) is a collection of these tables. We can store, filter, and execute any procedures supported by Apache Spark DataFrames on a table. We can query the table using Spark API and Spark SQL.
We use the Apache Spark DataFrame API in our Azure Databricks workspace to restructure and clean data. The Apache Spark DataFrame API provides many functions for manipulating data, such as selecting, filtering, joining, aggregating, etc. The DataFrame API seamlessly integrates operations with custom Python, SQL, Scala, and R code. In a Databricks workspace, we can load a data file, create a DataFrame, run SQL queries, and visualize the DataFrame results.
---
## References
```{r}
library(RefManageR)
BibOptions(check.entries = FALSE,
bib.style = "authoryear", # Bibliography style
max.names = 3, # Max author names displayed in bibliography
sorting = "nyt", #Name, year, title sorting
cite.style = "authoryear", # citation style
style = "markdown",
hyperlink = FALSE,
dashed = FALSE)
myBib <- ReadBib("assets/references.bib", check = FALSE)
```
```{r refs, echo=FALSE, results="asis"}
NoCite(myBib)
PrintBibliography(myBib)
```