generated from Cal-Poly-Advanced-R/Lab_2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlab2.Rmd
274 lines (220 loc) · 16.2 KB
/
lab2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
---
title: "Lab 2 Advanced Data Visualization"
author: "Gavin Martinez, Brandon Kim, Nils Berzins, Shreya Ravilla"
date: "2023-04-13"
output: html_document
---
### Lab 2 Part 1 - First Data Visualization Improvement
**1. While there are certainly issues with this image, do your best to tell the story of this graph in words. That is, what is this graph telling you? What do you think the authors meant to convey with it?**
The graph seems to show a plot of proportions of people who believe vaccines are safe per country. And each the plot seems to be separated by region, with certain countries being labeled. For each plot, there is a line within the plots, showing the median per each region.
**2. List the variables that appear to be displayed in this visualization.**
The variables in this visualization are countries, region of that country, and % of people who believe vaccines are safe per country.
On the x-axis, there is the % of the country's population that believes vaccines are safe. On the y-axis, it appears to be the index of the country in that region with 1 being the smallest % of believing vaccines are safe.
**3. Now that you're versed in the grammar of graphics (ggplot), list the aesthetics used and which variables are specified for each.**
The different regions are separated by the color of the points. Some of the countries are labeled. And the median of % of people who believe vaccines are safe per region is represented by a line within the plots.
**4. What type of graph would you call this?**
This wouldn't really be a scatter plot, as scatter plots take quantitative variables for both x and y. I would call this a segmented bar plot with points instead of bars. Each point represents a quantitative value for each country, which is what a bar plot should do. An argument could also be made that this visualization is multiple box plots, where we are creating box plots for each region. As the median is represented like a boxplot, and we have a decent tell of the spread. The only issue is, there are no boxes.
**5. List all of the problems or things you would improve about this graph.**
* No need for axis labels on both the top and bottom.
* y-axis really doesn't represent anything, so there's no need for increasing heights within each region.
* Really weird choice of graph in general. Something like this would be better represented by a box plot.
* Double labeling of colors; no need of a legend.
* Grid lines look off. They should be connecting between each section.
* The stacking leads to the graph being too tall to fit in one screen.
* The height of each corresponding region is associate purely with the number of countries in that region
* (Just Nils) The Europe/Former Soviet Union distinction bothers me. Being a proud Latvian, I'm disappointed to see that Wellcome Global Monitor is still identifying my proud country by its former oppressor. This form of categorization (however convenient for the analyst) is antiquated and insensitive to those who have since claimed their own independence from the Soviet Union.
* There is no mention of what the verticality of the dots means, it is eventually inferrable to be the country index.
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
```{r libaries}
library(tidyverse)
library(readxl)
library(ggthemes)
library(tidygeocoder)
library(leaflet)
library(readxl)
library(gganimate)
```
```{r readxl}
df_world <- read_xlsx(here::here("wgm2018-dataset-crosstabs-all-countries.xlsx"), sheet = 2)
```
```{r head}
head(df_world)
```
```{r dplyr}
plt <- df_world %>%
filter(Q23 == 1) %>%
group_by(WP5) %>%
summarize(vacc_agree = sum(Q25 %in% c(1, 2)),
total = n(),
Regions_Report) %>%
filter(row_number() == 1,
Regions_Report != 0) %>%
mutate(prop = vacc_agree/total * 100,
region = case_when(
Regions_Report %in% c(1, 2, 4, 5) ~ "Sub-Saharan Africa",
Regions_Report %in% c(6:8) ~ "Americas",
Regions_Report %in% c(9:12) ~ "Asia",
Regions_Report %in% c(3, 13) ~ "Middle East and North Africa",
Regions_Report %in% c(14:18) ~ "Europe and Oceania")
)
```
```{r ggplot}
ggplot(plt, aes(x = prop, y = region, fill = region)) +
geom_boxplot() +
geom_text(subset(plt, WP5 == 29),
mapping = aes(label = "Japan"),
vjust = 2,
size = 3) +
labs(title = "Boxplots of % of people who believe \nvaccines are safe, \nby global region",
x = "Percentages",
y = "") +
theme_economist() +
theme(legend.position = "none")
```
### Lab 2 Part 2: Second Data Visualization Improvement
**1. Select a data visualization in the report that you think could be improved. Be sure to cite both the page number and figure title. Do your best to tell the story of this graph in words. That is, what is this graph telling you? What do you think the authors meant to convey with it?**
The plot that I think could be improved is th eon titled "People seeking science information by region and internet access" on page 37. The graph is trying to show the proportion of citizens in each country surveyed that has responded yes to whether they have internet access, sought science information and have internet access, or sought science information and does not have internet access. It seems the authors meant to highlight the difference in internet access and willingness to search the internet from country to country.
**2. List the variables that appear to be displayed in this visualization.**
The variables that are displayed are country, and percentage of citizens within each country that answered yes for each of the three respective questions.
**3. Now that you're versed in the grammar of graphics (ggplot), list the aesthetics used and which variables are specified for each.**
x = percentage, group = question, fill = question
**4. What type of graph would you call this?**
Bar plot
**5. List all of the problems or things you would improve about this graph.**
It is hard to see which country is overall better or has the highest percentage for each question at first glance.
**6. Improve the visualization above by either re-creating it with the issues you identified fixed OR by creating a new visualization that you believe tells the same story better.**
```{r libaries2}
library(tidyverse)
library(here)
library(gganimate)
```
```{r data}
questions <- data.frame(country = rep(c("World", "Eastern Africa", "Central Africa", "North Africa", "Southern Africa", "Western Africa", "Central America & Mexico", "Northern America", "South America", "Central Asia", "East Asia", "Southeast Asia", "South Asia", "Middle East", "Eastern Europe", "Northern Europe", "Southern Europe", "Western Europe", "AUS & NZ"), each = 3),
question = rep(c("Has internet access", "Sought health info & has internet access", "Sought health info & does not have internet access"), times = 19),
proportion = c(.54,.52,.28,.24,.56,.35,.27,.60,.40,.48,.47,.23,.46,.43,.34,.34,.46,.29,.55,.52,.34,.91,.74,.38,.62,.51,.33,.55,.42,.34,.67,.51,.3,.46,.46,.27,.25,.35,.22,.72,.58,.38,.74,.47,.33,.91,.59,.38,.82,.61,.47,.9,.63,.51,.93,.66,.31))
questions <- questions %>%
mutate(rank = rank(-proportion, ties.method = 'random'))
```
```{r}
p.static <- ggplot(questions, aes(rank, group = country,
fill = as.factor(country), color = as.factor(country))) +
geom_tile(aes(y = proportion/2,
height = proportion,
width = .9), alpha = 0.8, color = NA) +
geom_text(aes(y = 0, label = paste(country, " ")), vjust = 0.2, hjust = 1, size = 6) +
geom_text(aes(y = proportion, label = proportion, hjust = 0), size = 6) +
coord_flip(clip = "off", expand = FALSE) +
scale_y_continuous(labels = scales::comma) +
scale_x_reverse() +
guides(color = FALSE, fill = FALSE) +
theme(axis.line = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "none",
panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_line(size = 0.1, color = "black"),
panel.grid.minor.x = element_line(size = 0.1, color = "black"),
plot.title = element_text(size = 25),
plot.subtitle = element_text(size=18, face = "italic", color = "black", hjust = 0.5, vjust = -1),
plot.background = element_blank(),
plot.margin = margin(2, 2, 2, 6, "cm"))
anim <- p.static +
transition_states(question, transition_length = 4, state_length = 4) +
labs(title = 'Proportion of citizens that answered yes to: {closest_state}')
animate(anim, 200, fps =5, width = 1200, height = 1000)
```
### Lab 2 Part 2 - Third Data Visualization Improvement
1. Select a data visualization in the report that you think could be improved. Be sure to cite both the page number and figure title. Do your best to tell the story of this graph in words. That is, what is this graph telling you? What do you think the authors meant to convey with it?
*Chart 4.5: The combined view of how people feel about the benefits of science on a personal and country level, Page 84*
This chart is describing the four analyzable outcomes between one's views on the extent to which science benefits society and on the extent to which sciecne benefits people normally. The four outcomes are: The Excluded, The Enthusiasts, Skeptics, and The Included. The authors are using proportionately sized circles within these four categorize to in attempt to visually describe which of these four opinions are more popular than the others.
2. List the variables that appear to be displayed in this visualization.
2 Categorical Variables:
* Views on extent to which science benefits society (most, some, very few)
* Views on extent to which science benefits people normally (Yes, No, Don't Know, Refused)
The results are then displayed as a percent of all of those surveyed.
3. Now that you're versed in the grammar of graphics (ggplot), list the aesthetics used and which variables are specified for each.
* Aesthetics - "Y": Benefiting society, "X": Benefiting people
4. What type of graph would you call this?
This graph visually looks like a simple correlogram/bubble plot however represents density of the respondents who answered a certain way as opposed to correlative relationship between numeric variables. Functionally, the graph matches a treemap the most with the unsure/chose not to answer respondents removed.
5. List all of the problems or things you would improve about this graph.
* The categories implied by the axis are not clear
* The specialized naming scheme is confusing
* The survey questions and axis don't seem to mean the same thing
* Axis labels are not centered
* Key not center right
6. Improve the visualization above by either re-creating it with the issues you identified fixed OR by creating a new visualization that you believe tells the same story better.
```{r Data Cleaning}
countries <- c("United States", "Egypt", "Morocco", "Lebanon", "Saudi Arabia", "Jordan", "Turkey", "Pakistan", "Indonesia", "Bangladesh", "United Kingdom", "France", "Germany", "Netherlands", "Belgium", "Spain", "Italy", "Poland", "Hungary", "Czech Republic", "Romania", "Sweden", "Greece", "Denmark", "Iran", "Singapore", "Japan", "China", "India", "Venezuela", "Brazil", "Mexico", "Nigeria", "Kenya", "Tanzania", "Israel", "Palestinian Territories", "Ghana", "Uganda", "Benin", "Madagascar", "Malawi", "South Africa", "Canada", "Australia", "Philippines", "Sri Lanka", "Vietnam", "Thailand", "Cambodia", "Laos", "Myanmar", "New Zealand", "Botswana", "Ethiopia", "Mali", "Mauritania", "Mozambique", "Niger", "Rwanda", "Senegal", "Zambia", "South Korea", "Taiwan", "Afghanistan", "Belarus", "Georgia", "Kazakhstan", "Kyrgyzstan", "Moldova", "Russia", "Ukraine", "Burkina Faso", "Cameroon", "Sierra Leone", "Zimbabwe", "Costa Rica", "Albania", "Algeria", "Argentina", "Armenia", "Austria", "Azerbaijan", "Bolivia", "Bosnia and Herzegovina", "Bulgaria", "Burundi", "Chad", "Chile", "Colombia", "Comoros", "Republic of Congo", "Croatia", "Cyprus", "Dominican Republic", "Ecuador", "El Salvador", "Estonia", "Finland", "Gabon", "Guatemala", "Guinea", "Haiti", "Honduras", "Iceland", "Iraq", "Ireland", "Ivory Coast", "Kuwait", "Latvia", "Liberia", "Libya", "Lithuania", "Luxembourg", "Macedonia", "Malaysia", "Malta", "Mauritius", "Mongolia", "Montenegro", "Namibia", "Nepal", "Nicaragua", "Norway", "Panama", "Paraguay", "Peru", "Portugal", "Serbia", "Slovakia", "Slovenia", "Eswatini", "Switzerland", "Tajikistan", "The Gambia", "Togo", "Tunisia", "Turkmenistan", "United Arab Emirates", "Uruguay", "Uzbekistan", "Yemen", "Kosovo", "Northern Cyprus")
#Chat GPT took the list from excel and made it a comma separated list, all hail Chat GPT
data_og <- read_xlsx(here::here("wgm2018-dataset-crosstabs-all-countries.xlsx"), sheet = 2)
#Question 1: In general, do you think the work that scientists do benefits most, some, or very few people in this country?
#Question 2: In general, do you think the work that scientists do benefits people like you in this country?
data_clean <- data_og |> #Changing the Q1 and Q2 values
select(WP5, Q16, Q17) |>
mutate(Q16 = case_when(
Q16 == 1 ~ "Most",
Q16 == 2 ~ "Some",
Q16 == 3 ~ "Very Few",
Q16 == 98 ~ "DK",
Q16 == 99 ~ "R")
) |>
mutate(Q17 = case_when(
Q17 == 1 ~ "Yes",
Q17 == 2 ~ "No",
Q17 == 98 ~ "DK",
Q17 == 99 ~ "R")
) |>
filter(Q16 %in% c("Most", "Some", "Very Few"),
Q17 %in% c("Yes", "No")) |>
rename("Country" = WP5, "Q1" = Q16, "Q2" = Q17) |>
mutate(`Country` = countries[`Country`])
data_clean <- data_clean |> #importing country names and adding the combo column, case_when() wasn't working, sorry for ifelse spam
mutate (Combo = ifelse(`Q1` %in% "Most" & `Q2` %in% "Yes", "Benefits Them & Most", NA),
Combo = ifelse(`Q1` %in% "Very Few" & `Q2` %in% "Yes", "Benefits Them But Not Most", Combo),
Combo = ifelse((`Q1` %in% "Some" | `Q1` %in% "Most") & `Q2` %in% "No", "Doesn't Benefit Them But Benefits Most", Combo),
Combo = ifelse(`Q1` %in% "Very Few" & `Q2` %in% "No", "Doesn't Benefit Them Nor Most", Combo)) |>
na.omit()
```
```{r Applying Geocode to each country}
data_clean_geo <- data_clean |> #Getting Geocode data for each country
geocode(country = `Country`) |>
na.omit()
```
```{r Leaflet Map}
leaflet() |>
addTiles() |>
addCircles(lng = data_clean_geo$long,
lat = data_clean_geo$lat,
radius = length(data_clean_geo$Combo[data_clean_geo$Combo %in% "Benefits Them & Most"]),
popup = data_clean_geo$Combo[data_clean_geo$Combo %in% "Benefits Them & Most"],
color = "#FFCD27",
opacity = .5) |>
addCircles(lng = data_clean_geo$long,
lat = data_clean_geo$lat,
radius = length(data_clean_geo$Combo[data_clean_geo$Combo %in% "Benefits Them But Not Most"]),
popup = data_clean_geo$Combo[data_clean_geo$Combo %in% "Benefits Them But Not Most"],
color = "#61933F",
opacity = .5) |>
addCircles(lng = data_clean_geo$long,
lat = data_clean_geo$lat,
radius = length(data_clean_geo$Combo[data_clean_geo$Combo %in% "Doesn't Benefit Them But Benefits Most"]),
popup = data_clean_geo$Combo[data_clean_geo$Combo %in% "Doesn't Benefit Them But Benefits Most"],
color = "#2FC1D2",
opacity = .5) |>
addCircles(lng = data_clean_geo$long,
lat = data_clean_geo$lat,
radius = length(data_clean_geo$Combo[data_clean_geo$Combo %in% "Doesn't Benefit Them Nor Most"]) ,
popup = data_clean_geo$Combo[data_clean_geo$Combo %in% "Doesn't Benefit Them Nor Most"],
color = "#003D57",
opacity = .5) |>
addLegend(position = "topright", title = "The combined view of how people feel about the benefits
of science on a personal and country level per country", labels = c("Benefits Them & Most", "Benefits Them But Not Most", "Doesn't Benefit Them But Benefits Most", "Doesn't Benefit Them Nor Most"), colors = c("#FFCD27", "#61933F", "#2FC1D2", "#003D57"))
```