forked from gaiusjaugustus/intro-r-20170825
-
Notifications
You must be signed in to change notification settings - Fork 0
/
04_Plotting.Rmd
439 lines (320 loc) · 16.7 KB
/
04_Plotting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
---
title: "04 - Plotting"
output:
html_notebook:
toc: yes
toc_float: yes
---
# Plotting is essential and can be done in base R
Something that every researcher knows is important is communicating your findings, and we often do that with plots. We can create fine tuned plots in R using BASE R, without using additional packages.
Let's read in a dataset, called `iris` and take a look at it.
```{r}
library(readr)
iris <- read_delim("datasets/iris.txt", delim="\t")
str(iris)
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : chr "setosa" "setosa" "setosa" "setosa" ...
```
This dataset has information on plants of 3 types of irises. They've measured the length and width of the petals of the flower and the sepals (green parts that often surround the flower).
Let’s make three main kinds of plot using base R—a scatterplot, a histogram, and a boxplot—then we’ll make these same plots using a R package specifically designed for making plots and figures called ggplot.
# Scatterplot
The basic plot function is plot(x, y, ….) which x corresponding to your x-variable and y to the y-variable.
Let’s plot sepal length as a function of petal length.
```{r}
plot(iris$Sepal.Length, iris$Petal.Length)
```
![scatterplot](plots/plot-1.png)
We see a scatterplot that shows there is a positive association between sepal and petal length. To add a linear regression line, you would need to use two commands abline() and lm(). lm() is used to fit linear models and uses the arguments lm(y ~ x), while abline will actually fit a line to the most recent plot. Let’s try it out.
```{r}
plot(iris$Sepal.Length, iris$Petal.Length)
abline(lm(iris$Petal.Length ~ iris$Sepal.Length))
```
![Trendline](plots/plot-2.png)
# Histogram
Plot will default to a scatterplot, but if you want a histogram then you need to use the type argument.
```{r}
plot(iris$Sepal.Length, type = 'h')
```
![histogram](plots/plot-3.png)
# Boxplot
To make a boxplot, you can use the function boxplot(x ~ y, data = dataframe). Let’s plot sepal length as a function of species.
```{r}
boxplot(Sepal.Length ~ Species, data = iris)
```
![boxplot](plots/plot-4.png)
If you ever want to change what order the categories on the x-axis are displayed in you would need to order the factor levels of that column.
Plotting in base R can be flexible and you can actually do a lot with it, but many people find ggplot more user friendly and easier to learn. Let’s move on and learn how to do these plots using the ggplot package. Whichever you decide to use, there is a lot of help online if you need it.
# EXERCISE
Pull up the plot help page. What arguments would you use to change the x and y axis label?
Change the axes labels for the first graph we did.
```{r}
plot(iris$Sepal.Length, iris$Petal.Length)
```
[Answer](exercises/04_Plotting_Answers.Rmd)
# ggplot2
I'm just going to jump in and make a scatter plot of petal length and width.
# EXERCISE
Copy and paste the code below and look at the plot.
```{r}
library(ggplot2)
ggplot(data = iris, aes(x=Petal.Length, y=Petal.Width)) + geom_point()
```
![First ggplot](plots/ggplot-1.png)
ggplot2 works on the idea that every plot has three essential elements:
| Element | Description |
| ---------- | -------------------------------------- |
| Data | The dataset being plotted. |
| Aesthetics | The scales onto which we map our data. |
| Geometries | The visual elements used for our data. |
In other words, we have the dataset, the space onto which we will plot our data (axes), and the visualization we will use to plot each datapoint (scatterplot, barplot, boxplot). These are the 3 elements we will discuss today.
# ggplot2 is expansive
ggplot2 is great because you can start with a small amount of code and it can expand rapidly to customize the plot.
```{r}
ggplot(iris, aes(x=Species, y=Sepal.Width)) +
geom_violin(alpha=0.5, show.legend = FALSE, aes(fill = Species)) +
stat_boxplot(geom ='errorbar', width=0.25) +
geom_boxplot(width=0.5, show.legend = FALSE) +
geom_dotplot(binaxis="y", stackdir="center", position="dodge", dotsize=0.75, aes(fill=Species)) +
ggtitle("Sepal width for three species of iris") +
labs(x="Iris Species", y="Sepal Width") +
theme_classic() +
theme(axis.title = element_text(size=15),
plot.title = element_text(size = 24, hjust = 0.5),
axis.text = element_text(size=13, color = "black")) +
scale_fill_manual(values = c("#3A9276","#F14705","#4F68A1"))
```
![Advanced iris plot](plots/ggplot-7.png)
Another example of a more traditional publication ready plot. [Reference](https://guangchuangyu.github.io/)
```{r}
Normal <- c(0.83, 0.79, 0.99, 0.69)
Cancer <- c(0.56, 0.56, 0.64, 0.52)
m <- c(mean(Normal), mean(Cancer))
s <- c(sd(Normal), sd(Cancer))
d <- data.frame(V=c("Normal", "Cancer"), mean=m, sd=s)
d$V <- factor(d$V, levels=c("Normal", "Cancer"))
ggplot(d, aes(V, mean, fill=V, width=.5)) +
geom_errorbar(aes(ymin=mean, ymax=mean+sd, width=.2),
position=position_dodge(width=.8)) +
geom_bar(stat="identity", position=position_dodge(width=.8), colour="black") +
scale_fill_manual(values=c("grey80", "white")) +
theme_bw() +theme(legend.position="none") + xlab("") + ylab("") +
theme(axis.text.x = element_text(face="bold", size=14),
axis.text.y = element_text(face="bold", size=14)) +
scale_y_continuous(expand=c(0,0), limits=c(0, 1.2), breaks=seq(0, 1.2, by=.2)) +
geom_segment(aes(x=1, y=.98, xend=1, yend=1.1)) +
geom_segment(aes(x=2, y=.65, xend=2, yend=1.1)) +
geom_segment(aes(x=1, y=1.1, xend=2, yend=1.1)) +
annotate("text", x=1.5, y=1.06, label="*")
```
![Publication Ready Plot](plots/ggplot-8.png)
In the interest of time, we aren't going to go through all of these today, but I just want to show you that ggplot2 is very versatile.
# Syntax of ggplot
The basic syntax of ggplot2 is to start the line with the function `ggplot()`. In the parentheses, you want at minimum to name your dataset.
```{r}
ggplot(iris)
```
![Empty plot](plots/ggplot-9.png)
Notice that it opens the Plot window, but nothing is there. That's because we haven't yet told it what to do with our dataset.
Next we have to give it the aesthetics. That is, how do we want to represent our data.
We do this by adding an argument called `aes()`. Note that the aesthetics have to be within these parentheses.
The most straightforward thing to add is the columns we want to plot on the axes.
```{r}
ggplot(iris, aes(x=Sepal.Length))
```
![Plot with coordinates only](plots/ggplot-10.png)
Notice it opens the Plot window, and there's even an axis, but no data has been plotted. This is because we haven't told it what kind of plot (geometry) we want.
ggplot has several plot types, or geometries, that each start with `geom_`. The ones you'll likely use the most are:
* `geom_point` - scatter plots
* `geom_histogram` - for histograms
* `geom_boxplot` - for boxplots
* `geom_bar` - for barplots
Let's try plotting the same thing but add the geom_point.
```{r}
ggplot(iris, aes(x=Sepal.Length)) + geom_point()
# Error: geom_point requires the following missing aesthetics: y
```
We get an error, because scatterplots need a x and a y axis.
We can also add another layer. We'll add one called `geom_smooth`, which allows us to add a trend line or spline.
```{r}
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) + geom_point() + geom_smooth()
```
![Scatter plus smooth](plots/ggplot-11.png)
The default for `geom_smooth()` is a spline. We can do a trendline by adding an argument, `method = "lm"`. Remember, you can use the `help()` function to find out all the options available for a function. Ggplot2 also has extensive and easy to understand documentation (see [resources document](https://github.com/gaiusjaugustus/intro-r-20170825/blob/master/resources/CheatSheetsAndResources.Rmd#ggplot-cheatsheet-and-help)).
**Note:** An important point is that each "layer" of complexity is drawn "in order", meaning that it renders the plot in the order that you type it. This means that the last "layer" will lay on top of the one before it.
Let's instead try a histogram.
```{r}
ggplot(iris, aes(x=Sepal.Length)) + geom_histogram()
```
![ggplot histogram](plots/ggplot-12.png)
This one works, and you should see the histogram, which shows how many datapoints lie in each bin.
**Note:** You also get a warning, stating that the binwidth wasn't defined, so a default was used.
# Extra options
There are some extra options for each plot that you can use to make your data stand out more.
## Histogram
Let's start with the histogram we just made and check out a few of the features we can tweak.
```{r}
ggplot(iris, aes(x=Sepal.Length)) + geom_histogram()
```
![Histogram again](plots/ggplot-12.png)
If we add a grouping feature, we can change the fill color of the bars based on species. We do this using the `fill` argument.
```{r}
ggplot(iris, aes(x=Sepal.Length, fill=Species)) + geom_histogram()
```
![fill argument](plots/ggplot-13.png)
These histograms are stacked on each other, but what if instead we want them independent of each other. We can use the `position` argument in the `geom_histogram` call to fix this. If we change it to `identity`, it gives each species it's own histogram overlaid on each other. It's difficult to see, so I've also added the `alpha` argument, which changes how see through each layer is.
```{r}
ggplot(iris, aes(x=Sepal.Length, fill=Species)) + geom_histogram(position="identity", alpha=0.5)
```
![position argument](plots/ggplot-14.png)
Let's compare this to the original plot with the layer 50% see through. Notice how it's all one stacked layer.
```{r}
ggplot(iris, aes(x=Sepal.Length, fill=Species)) + geom_histogram(alpha=0.5)
```
![alpha argument](plots/ggplot-15.png)
## Scatterplots
We can also change the appeal and readability of plots. Let's take a look at scatterplots and how we can change things to help explore our data. First, let's try changing the color. If we give it a continuous variable, it creates a gradient.
```{r}
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Sepal.Width)) + geom_point()
```
![scatterplot with color argument](plots/ggplot-16.png)
If instead we give it a categorical variable, such as `Species`, it assigns colors.
```{r}
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) + geom_point()
```
![categorical color argument](plots/ggplot-17.png)
We can also just assign a color that we like.
```{r}
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color="coral")) + geom_point()
```
![assigned color](plots/ggplot-18.png)
**Note:** The color option is inside the aesthetics `aes()` function!
For scatterplots, we can also assign shapes. Shapes only make sense if used with categorical data.
```{r}
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, shape=Species)) + geom_point()
```
![shapes](plots/ggplot-19.png)
Sometimes, points stack up on each other, and we need to spread them out a bit. This is called overplotting, and we can address it by making the points transparent, or by "jittering" the points just a little so they aren't all stacked on top of each other. We can do this by adding a `position` argument to the `geom_point()` layer.
```{r}
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, shape=Species)) + geom_point(position="jitter")
```
![jitter](plots/ggplot-20.png)
Alternately, we can use the `geom_jitter` geometry. It works just like the `geom_point` geometry, by moving the points a little so that you can prevent overplotting.
```{r}
ggplot(iris, aes(x=Species, y=Sepal.Length)) + geom_point()
```
![without geom_jitter](plots/ggplot-21.png)
```{r]}
ggplot(iris, aes(x=Species, y=Sepal.Length)) + geom_jitter()
```
![with geom_jitter](plots/ggplot-22.png)
We can control the amount of jitter with the `width` argument.
```{r}
ggplot(iris, aes(x=Species, y=Sepal.Length)) + geom_jitter(width=0.25)
```
![jitter width](plots/ggplot-23.png)
You can customize all of the colors and shapes instead of leaving it default. In the resources section of the GitHub repo, you can find information on these more advanced topics.
## Boxplot
For a basic boxplot, you can use ‘geom_boxplot()’
```{r}
ggplot(iris, aes(x=Species, y=Sepal.Length)) + geom_boxplot()
```
![boxplot](plots/ggplot-24.png)
We can also change the color of a boxplot.
```{r}
ggplot(iris, aes(x=Species, y=Sepal.Length, color=Species)) + geom_boxplot()
```
![color boxplot](plots/ggplot-25.png)
It outlined the boxes. But say we want to fill in the boxes instead. For this, we need to use the `fill` option.
```{r}
ggplot(iris, aes(x=Species, y=Sepal.Length, fill=Species)) + geom_boxplot()
```
![fill argument](plots/ggplot-26.png)
## Barplot
There are a few additional features on barplots. We'll start with a basic barplot.
```{r}
ggplot(iris, aes(Petal.Width)) + geom_bar()
```
![bar plot](plots/ggplot-27.png)
We get a bar for each plot. Note that this probably isn't the best way to visualize this data, but I just want to give you an example of ways to customize a bar plot.
Bar plots have some additional functionality. For example, we can add an aesthetic to consider Species. This creates a stacked barplot.
```{r}
ggplot(iris, aes(Sepal.Length, fill=Species)) + geom_bar()
```
![barplot with fill](plots/ggplot-28.png)
Just like with the histogram, we can change how these bars lay around each other with the `position` argument, which we have to add to the `geom_bar()` statement
```{r}
ggplot(iris, aes(Sepal.Length, fill=Species)) + geom_bar(position = "dodge")
```
![dodge](plots/ggplot-29.png)
```{r}
ggplot(iris, aes(Sepal.Length, fill=Species)) + geom_bar(position = "fill")
```
![fill](plots/ggplot-30.png)
```{r}
ggplot(iris, aes(Sepal.Length, fill=Species)) + geom_bar(position = "stack")
```
![stack](plots/ggplot-31.png)
Notice how the bars change as we change the position.
You can continue to add elements to the graph (e.g. changing the axes and adding titles) by adding lines with ‘+’.
Here are some basic elements you can add:
* `xlab(label)` changes x-axis label
* `ylab(label)` changes y-axis label
* `ggtitle(label, subtitle = NULL)` Adds plot title and an optional subtitle
* `theme()` can be used to change the background, remove grid, and change the border
* `facet_grid()` divides a single graph into multiple graphs in a grid based on categorical data
e.g. for the iris data, you could have separate graphs for each species by adding
```{r}
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Sepal.Width)) +
geom_point() +
facet_grid(. ~ Species) +
xlab('Sepal length (mm)') +
ylab('Sepal width (mm)') +
theme_classic()
```
![more advanced graph](plots/ggplot-32.png)
# More advanced options
ggplot2 makes this a little more comprehensive by adding the following:
| Element | Description |
| ----------- | ------------------------------------------------- |
| Facets | Plotting small multiples. |
| Statistics | Representations of our data to aid understanding. |
| Coordinates | The space on which the data will be plotted. |
| Themes | All non-data ink. |
# Examples of Plots with these elements
## Facets
```{r}
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() + facet_grid(.~Species)
```
![Facets](plots/ggplot-2.png)
## Statistics
```{r}
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() + facet_grid(.~Species) + geom_smooth(method = "lm")
```
![Statistics](plots/ggplot-3.png)
## Coordinates
```{r}
ggplot(iris, aes(Sepal.Length, Sepal.Width, color=Species)) +
geom_jitter() +
coord_cartesian(xlim = c(4,6), ylim=c(2.5,4))
```
![Coordinates](plots/ggplot-4.png)
## Themes
```{r}
ggplot(iris, aes(x=Species, y=Sepal.Width, fill=Species)) +
geom_boxplot(alpha=0.6, width=0.5) +
theme_dark()
```
![Dark theme](plots/ggplot-5.png)
```{r}
ggplot(iris, aes(x=Species, y=Sepal.Width, fill=Species)) +
geom_boxplot(alpha=0.6, width=0.5) +
theme_light()
```
![Light theme](plots/ggplot-6.png)
**Note:** Pretty much anything that you would like to change can be. You can find numerous examples by googling what you want to change (e.g. google ‘remove background grid ggplot’). Additionally, there are resources to help you in our [resources document](https://github.com/gaiusjaugustus/intro-r-20170825/blob/master/resources/CheatSheetsAndResources.Rmd).