forked from datacarpentry/R-ecology-lesson
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05-visualization-ggplot2.Rmd
413 lines (312 loc) · 13.9 KB
/
05-visualization-ggplot2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
---
title: "Data visualization with ggplot2"
minutes: 60
output: pdf_document
subtitle: Visualizing data in R with the ggplot2 package
layout: topic
---
```{r, echo=FALSE, purl=FALSE}
knitr::opts_chunk$set(fig.keep='last')
```
```{r setup, echo=FALSE, purl=FALSE}
source("setup.R")
```
Authors: **Mateusz Kuzak**, **Diana Marek**, **Hedi Peterson**, **Dmytro Fishman**
#### Disclaimer
We will be using the functions in the ggplot2 package. There are basic plotting
capabilities in basic R, but ggplot2 adds more powerful plotting capabilities.
> ### Learning Objectives
>
> - Visualize some of the [global temperature data](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data)
> from Figshare [temperature.csv]('https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/4938964/temperature.csv')
> - Understand how to plot these data using R ggplot2 package. For more details
> on using ggplot2 see [official documentation](http://docs.ggplot2.org/current/).
> - Building step by step complex plots with the ggplot2 package, see
> [handy cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/08/ggplot2-cheatsheet.pdf)
Load required packages
```{r}
# plotting package
library(ggplot2)
# piping / chaining
library(magrittr)
# modern dataframe manipulations
library(dplyr)
```
```{r}
temperature_raw <- read.csv('data/temperature.csv')
```
## Data cleaning and preparing for plotting
Let's look at the summary
```{r}
summary(temperature_raw)
```
There are few things we need to clean in the dataset.
There are missing values for City and Country in some records. Let's remove those.
```{r}
temperature_complete <- temperature_raw %>%
filter(!is.na(City)) %>%
filter(!is.na(Country))
```
We saw in summary, there were NA's in AverageTemperatureFahr and
AverageTemperatureUncertaintyFahr Let's remove rows with missing values
in those fields. In fact, let's combine this with
removing empty City and Country, so we have one command and don't make lots
of intermediate variable names. This is where piping becomes really handy!
```{r}
temperature_complete <- temperature_raw %>%
filter(!is.na(City)) %>%
filter(!is.na(Country)) %>%
filter(!is.na(AverageTemperatureFahr)) %>%
filter(!is.na(AverageTemperatureUncertaintyFahr))
```
`day` field has no other values besides `1`, let's remove it from our data.
```{r}
# get rid of `day` column
temperature_complete <- select(temperature_complete, -(day))
head(temperature_complete)
```
We already saw how changing Fahrenheit to Celsius helps to understand
the data better, let's convert AverageTemperatureFahr and
AverageTemperatureUncertaintyFahr into Celsius.
```{r}
# Use mutate to convert Fahrenheit to Celsius and store values in
# separate columns
temperature_complete <- temperature_complete %>%
mutate(AverageTemperatureCelsius = (AverageTemperatureFahr-32)*(5/9)) %>%
mutate(AverageTemperatureUncertaintyCelsius = (AverageTemperatureUncertaintyFahr-32)*(5/9))
```
Now we can actually get rid of unnecessary columns with Fahrenheit information by
assigning `NULL` value to these columns.
```{r}
# Remove AverageTemperatureFahr and AverageTemperatureUncertaintyFahr
# by assigning them with NULL
temperature_complete$AverageTemperatureFahr <- NULL
temperature_complete$AverageTemperatureUncertaintyFahr <- NULL
head(temperature_complete)
```
Make simple scatter plot of `AverageTemperatureCelsius` as a function of
`year`, using basic R plotting capabilities.
```{r base-plot}
plot(x = temperature_complete$year, y = temperature_complete$AverageTemperatureCelsius)
```
## Plotting with ggplot2
We will make the same plot using the `ggplot2` package.
`ggplot2` is a plotting package that makes it simple to create complex plots
from data in a dataframe. It uses default settings, which help creating
publication quality plots with a minimal amount of settings and tweaking.
ggplot graphics are built step by step by adding new elements.
To build a ggplot we need to:
- bind the plot to a specific data frame using the `data` argument
```{r, eval=FALSE}
ggplot(data = temperature_complete)
```
- define aesthetics (`aes`), that maps variables in the data to axes on the plot
or to plotting size, shape color, etc.,
```{r, eval=FALSE}
ggplot(data = temperature_complete, aes(x = year, y = AverageTemperatureCelsius))
```
- add `geoms` -- graphical representation of the data in the plot (points,
lines, bars). To add a geom to the plot use `+` operator:
```{r first-ggplot}
ggplot(data = temperature_complete, aes(x = year, y = AverageTemperatureCelsius)) +
geom_point()
```
Notes:
- Anything you put in the `ggplot()` function can be seen by any geom layers that you add.
i.e. these are universal plot settings
- This includes the x and y axis you set up in `aes()`
## Modifying plots
- adding transparency (alpha)
```{r adding-transparency}
ggplot(data = temperature_complete, aes(x = year, y = AverageTemperatureCelsius)) +
geom_point(alpha = 0.1)
```
- adding colors
```{r adding-colors}
ggplot(data = temperature_complete, aes(x = year, y = AverageTemperatureCelsius)) +
geom_point(alpha = 0.1, color = "blue")
```
## Boxplot
Visualising the distribution of weight within each species.
```{r boxplot}
ggplot(data = temperature_complete, aes(x = country_id, y = AverageTemperatureCelsius)) +
geom_boxplot()
```
By adding points to boxplot, we can see particular measurements and the
abundance of measurements.
```{r boxplot-with-points}
ggplot(data = temperature_complete, aes(x = country_id, y = AverageTemperatureCelsius)) +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_boxplot(alpha = 0)
```
Notice how the boxplot layer is on top of the jitter layer? Play around with
the order of geoms and adjust transparency to see how to build up your plot in
layers.
```{r, eval=FALSE, purl=FALSE}
## For example
ggplot(data = temperature_complete, aes(x = country_id, y = AverageTemperatureCelsius)) +
geom_boxplot(alpha = 0.9, size = 0.9) +
geom_jitter(alpha = 0.1, color = "tomato")
```
> ### Challenges
Boxplots are useful summaries, but hide the *shape* of the distribution. For example, if
there is a bimodal distribution, this would not be observed with a boxplot. An alternative is
a violin plot (sometimes known as a beanplot), where the shape (of the density of points) is
drawn.
> - Replace the box plot with a violin plot; see `geom_violin()`
```{r, eval=FALSE, purl=FALSE}
## Answer
ggplot(data = temperature_complete, aes(x = country_id, y = AverageTemperatureCelsius)) +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_violin(alpha = 0)
```
> - Create boxplot for `AverageTemperatureUncertaintyCelsius`, which is a 95%
> confidence interval from average temperature, it is always positive.
```{r, eval=FALSE, purl=FALSE}
## Answer
ggplot(data = temperature_complete, aes(x = country_id, y = AverageTemperatureUncertaintyCelsius)) +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_boxplot(alpha = 0)
```
In many types of data, it is important to consider the *scale* of the observations. For example, it may be
worth changing the scale of the axis to better distribute the observations in the space of the plot.
Changing the scale of the axes is done similarly to adding/modifying other components
(i.e., by incrementally adding commands).
> - Represent AverageTemperatureUncertaintyCelsius on the log10 scale; see `scale_y_log10()`
```{r, eval=FALSE, purl=FALSE}
## Answer
ggplot(data = temperature_complete, aes(x = country_id, y = AverageTemperatureUncertaintyCelsius)) +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_boxplot(alpha = 0) +
scale_y_log10()
```
## Plotting time series data
Let's calculate average temperature per year per each country.
To do that we need to group data first by year and country
and summarise it by computing mean from every group.
```{r}
yearly_country_temp <- temperature_complete %>%
group_by(year, Country) %>%
summarise(countryAverage = mean(AverageTemperatureCelsius))
```
Timelapse data can be visualised as a line plot with years on x axis and temperature
on y axis.
```{r first-time-series}
ggplot(data = yearly_country_temp, aes(x = year, y = countryAverage)) +
geom_line()
```
Unfortunately this does not work, because we plot data for all the countries
together. We need to tell ggplot to split graphed data by `Country`
```{r time-series-by-countries}
ggplot(data = yearly_country_temp, aes(x = year, y = countryAverage, group = Country)) +
geom_line()
```
We will be able to distinguish countries even better in the plot if we add colors.
```{r time-series-with-colors}
ggplot(data = yearly_country_temp, aes(x = year, y = countryAverage, group = Country, color = Country)) +
geom_line()
```
## Faceting
ggplot has a special technique called *faceting* that allows to split one plot
into multiple plots based on some factor. We will use it to plot one time
series for each species separately.
```{r first-facet}
ggplot(data = yearly_country_temp, aes(x = year, y = countryAverage, color = Country)) +
geom_line() +
facet_wrap(~Country)
```
Now we would like to split line in each plot by city of each country.
To do that we need to make measurements in dataframe grouped by country and city.
> ### Challenges:
>
> - group by year, Country, City
```{r}
yearly_country_city_temp <- temperature_complete %>%
group_by(year, Country, City) %>%
summarise(cityAverage = mean(AverageTemperatureCelsius))
```
> - make the faceted plot splitting further by City (within single plot)
```{r facet-by-country-and-city}
ggplot(data = yearly_country_city_temp, aes(x = year, y = cityAverage, color = Country, group = City)) +
geom_line() +
facet_wrap(~ Country)
```
Usually plots with white background looks more readable in the publications.
We can set the background to white using the function `theme_bw()`.
```{r facet-by-country-and-city-white-bg}
ggplot(data = yearly_country_city_temp, aes(x = year, y = cityAverage, color = Country, group = City)) +
geom_line() +
facet_wrap(~ Country) +
theme_bw()
```
> We can improve the plot by coloring by city instead of country (countries are
> already in separate plots, so we don't need to distinguish them better)
```{r facet-by-country-and-city-colored}
ggplot(data = yearly_country_city_temp, aes(x = year, y = cityAverage, color = City, group = City)) +
geom_line() +
facet_wrap(~ Country) +
theme_bw()
```
So far our result looks quite good, but it is yet far from being publishable.
What are other ways one can improve it? Take a look at the ggplot2
cheat sheet (https://www.rstudio.com/wp-content/uploads/2015/08/ggplot2-cheatsheet.pdf),
and write down at least three more ideas (can leave them as comments
to Etherpad).
Now, let's change names of axes to something more informative than 'year'
and 'cityAverage' and add title to this figure:
```{r number_species_year_with_right_labels}
ggplot(data = yearly_country_city_temp, aes(x = year, y = cityAverage, color = City, group = City)) +
geom_line() +
facet_wrap(~ Country) +
labs(title = 'Average temperature',
x = 'Year of observation',
y = 'Average temperature') +
theme_bw()
```
Now, thanks to our efforts, axes have much more informative names,
yet quite small so it could be hard to read them.
Let's change their size (and font just in sake of fun):
```{r number_species_year_with_right_labels_xfont_size}
ggplot(data = yearly_country_city_temp, aes(x = year, y = cityAverage, color = City, group = City)) +
geom_line() +
facet_wrap(~ Country) +
labs(title = 'Average temperature',
x = 'Year of observation',
y = 'Average temperature') +
theme_bw(base_size = 16) +
theme(plot.title=element_text(family="Times", face="bold", size=20))
```
After our manipulations we notice that the values on the x-axis are still not
properly readable. Let's change the orientation of the labels and adjust them
vertically and horizontally so they don’t overlap. You can use a 90 degree
angle, or experiment to find the appropriate angle for diagonally oriented
labels.
```{r number_species_year_with_right_labels_xfont_orientation}
my_perfect_plot <- ggplot(data = yearly_country_city_temp, aes(x = year, y = cityAverage, color = City, group = City)) +
geom_line() +
facet_wrap(~ Country) +
labs(title = 'Average temperature',
x = 'Year of observation',
y = 'Average temperature') +
theme_bw(base_size = 16) +
theme(plot.title=element_text(family="Times", face="bold", size=20),
axis.text.x = element_text(colour="grey20", size=12, angle=315, hjust=.5, vjust=.5),
axis.text.y = element_text(colour="grey20", size=12))
my_perfect_plot
```
Now, labels became bigger and readable, but there are still few things
that one could be improved. Please, take another five minutes and try to
add another one or two things, to make it look even more beautiful.
Use ggplot2 cheat sheet, which we linked earlier for inspiration.
Here are some ideas:
* See if you can change thickness of the lines.
* Can you find a way to change the name of the legend?
What about its labels?
* Use different color palette to improve the look
(http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/)
When you created your perfect plot. You can save it to the file in your favourite format.
You can easily change the size by specifying the width and height of your plot.
```{r number_species_year_with_right_labels_xfont_save, eval=FALSE}
ggsave(filename = "figures/observed_species_in_time.png", plot = my_perfect_plot, width=15, height=10)
```
Enjoy plotting with ggplot2!