forked from datacarpentry/R-ecology-lesson
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-data-frames.Rmd
324 lines (258 loc) · 11.7 KB
/
03-data-frames.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
---
title: "The `data.frame` class"
author: "Data Carpentry contributors"
minutes: 30
output: pdf_document
layout: topic
---
```{r, echo=FALSE, purl=FALSE, message = FALSE}
source("setup.R")
temperature <- read.csv("data/temperature.csv")
```
```{r, echo=FALSE, purl=TRUE}
## The data.frame class
```
------------
> ## Learning Objectives
>
> * understand the concept of a `data.frame`
> * use sequences
> * know how to access any element of a `data.frame`
------------
## What are data frames?
Data frames are the _de facto_ data structure for most tabular data, and what we
use for statistics and plotting.
A data frame is a collection of vectors of identical lengths. Each vector
represents a column, and each vector can be of a different data type (e.g.,
characters, integers, factors). The `str()` function is useful to inspect the
data types of the columns.
A data frame can be created by hand, but most commonly they are generated by the
functions `read.csv()` or `read.table()`; in other words, when importing
spreadsheets from your hard drive (or the web).
By default, when building or importing a data frame, the columns that contain
characters (i.e., text) are coerced (=converted) into the `factor` data
type. Depending on what you want to do with the data, you may want to keep these
columns as `character`. To do so, `read.csv()` and `read.table()` have an
argument called `stringsAsFactors` which can be set to `FALSE`:
```{r, eval=FALSE, purl=FALSE}
some_data <- read.csv("data/some_file.csv", stringsAsFactors=FALSE)
```
You can also create a data frame manually with the function `data.frame()`. This
function can also take the argument `stringsAsFactors`. Compare the output of
these examples, and compare the difference between when the data are being read
as `character`, and when they are being read as `factor`.
```{r, results='show', purl=TRUE}
## Compare the output of these examples, and compare the difference between when
## the data are being read as `character`, and when they are being read as
## `factor`.
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8))
str(example_data)
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8), stringsAsFactors=FALSE)
str(example_data)
```
### Challenge
1. There are a few mistakes in this hand crafted `data.frame`, can you spot and
fix them? Don't hesitate to experiment!
```{r, eval=FALSE, purl=FALSE}
author_book <- data.frame(author_first=c("Charles", "Ernst", "Theodosius"),
author_last=c(Darwin, Mayr, Dobzhansky),
year=c(1942, 1970))
```
```{r, eval=FALSE, purl=TRUE, echo=FALSE}
## Challenge
## There are a few mistakes in this hand crafted `data.frame`,
## can you spot and fix them? Don't hesitate to experiment!
author_book <- data.frame(author_first=c("Charles", "Ernst", "Theodosius"),
author_last=c(Darwin, Mayr, Dobzhansky),
year=c(1942, 1970))
```
1. Can you predict the class for each of the columns in the following example?
Check your guesses using `str(country_climate)`:
* Are they what you expected? Why? Why not?
* What would have been different if we had added `stringsAsFactors = FALSE` to this call?
* What would you need to change to ensure that each column had the accurate data type?
```{r, eval=FALSE, purl=FALSE}
country_climate <- data.frame(country=c("Canada", "Panama", "South Africa", "Australia"),
climate=c("cold", "hot", "temperate", "hot/temperate"),
temperature=c(10, 30, 18, "15"),
northern_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"),
has_kangaroo=c(FALSE, FALSE, FALSE, 1))
```
```{r, eval=FALSE, purl=TRUE, echo=FALSE}
## Challenge:
## Can you predict the class for each of the columns in the following
## example?
## Check your guesses using `str(country_climate)`:
## * Are they what you expected? Why? why not?
## * What would have been different if we had added `stringsAsFactors = FALSE`
## to this call?
## * What would you need to change to ensure that each column had the
## accurate data type?
country_climate <- data.frame(country=c("Canada", "Panama", "South Africa", "Australia"),
climate=c("cold", "hot", "temperate", "hot/temperate"),
temperature=c(10, 30, 18, "15"),
northern_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"),
has_kangaroo=c(FALSE, FALSE, FALSE, 1))
```
```{r, eval=FALSE, purl=FALSE}
## Answers
## * missing quotations around the first names of the authors
## * the year column is missing one value, 1859 (the year of publication of
## the origin of species)
```
```{r, eval=FALSE, purl=FALSE}
## Answers
## * `country`, `climate`, `temperature`, and `northern_hemisphere` are
## factors; `has_kangaroo` is numeric.
## * using `stringsAsFactors=FALSE` would have made them character instead of
## factors
## * removing the quotes in temperature, northern_hemisphere, and replacing 1
## by TRUE in the `has_kangaroo` column would probably what was originally
## intended.
```
The automatic conversion of data type is sometimes a blessing, sometimes an
annoyance. Be aware that it exists, learn the rules, and double check that
data you import in R are of the correct type within your data frame. If not,
use it to your advantage to detect mistakes that might have been introduced
during data entry (a letter in a column that should only contain numbers for
instance.).
## Inspecting `data.frame` Objects
We already saw how the functions `head()` and `str()` can be useful to check the
content and the structure of a `data.frame`. Here is a non-exhaustive list of
functions to get a sense of the content/structure of the data.
* Size:
* `dim()` - returns a vector with the number of rows in the first element,
and the number of columns as the second element (the __dim__ensions of the
object)
* `nrow()` - returns the number of rows
* `ncol()` - returns the number of columns
* Content:
* `head()` - shows the first 6 rows
* `tail()` - shows the last 6 rows
* Names:
* `names()` - returns the column names (synonym of `colnames()` for `data.frame`
objects)
* `rownames()` - returns the row names
* Summary:
* `str()` - structure of the object and information about the class, length and
content of each column
* `summary()` - summary statistics for each column
Note: most of these functions are "generic", they can be used on other types of
objects besides `data.frame`.
## Indexing, Sequences, and Subsetting
```{r, echo=FALSE, purl=TRUE}
## Indexing, Sequences, and Subsetting
```
If we want to extract one or several values from a vector, we must provide one
or several indices in square brackets. For instance:
```{r, results='show', purl=FALSE}
animals <- c("mouse", "rat", "dog", "cat")
animals[2]
animals[c(3, 2)]
animals[2:4]
more_animals <- animals[c(1:3, 2:4)]
more_animals
```
R indexes start at 1. Programming languages like Fortran, MATLAB, and R start
counting at 1, because that's what human beings typically do. Languages in the C
family (including C++, Java, Perl, and Python) count from 0 because that's
simpler for computers to do.
`:` is a special function that creates numeric vectors of integers in increasing
or decreasing order, test `1:10` and `10:1` for instance. The function `seq()`
(for __seq__uence) can be used to create more complex patterns:
```{r, results='show', purl=FALSE}
seq(1, 10, by=2)
seq(5, 10, length.out=3)
seq(50, by=5, length.out=10)
seq(1, 8, by=3) # sequence stops to stay below upper limit
```
Our temperature data frame has rows and columns (it has 2 dimensions), if we want to
extract some specific data from it, we need to specify the "coordinates" we want
from it. Row numbers come first, followed by column numbers.
```{r, purl=FALSE}
temperature[1] # first column in the data frame
temperature[1, 1] # first element in the first column of the data frame
temperature[1, 6] # first element in the 6th column
temperature[1:3, 7] # first three elements in the 7th column
temperature[3, ] # the 3rd element for all columns
temperature[, 8] # the entire 8th column
head_temperature <- temperature[1:6, ] # temperature[1:6, ] is equivalent to head(temperature)
```
As well as using numeric values to subset a `data.frame` (or `matrix`), columns
can be called by name, using one the three following notations:
```{r, eval = FALSE, purl=FALSE}
temperature[, "City"]
temperature[["City"]]
temperature$City
```
For our purposes, these three notations are equivalent. However, the last one
with the `$` does partial matching on the name. So you could also select the
column `"month"` by typing `temperature$m`. It's a shortcut, as with all shortcuts,
they can have dangerous consequences, and are best avoided. Besides, with
auto-completion in RStudio, you rarely have to type more than a few characters
to get the full and correct column name.
### Challenge
1. The function `nrow()` on a `data.frame` returns the number of rows. Use it,
in conjuction with `seq()` to create a new `data.frame` called
`surveys_by_10` that includes every 10th row of the survey data frame
starting at row 10 (10, 20, 30, ...)
```{r, purl=TRUE}
### The function `nrow()` on a `data.frame` returns the number of
### rows. Use it, in conjuction with `seq()` to create a new
### `data.frame` called `temperature_by_10` that includes every 10th row
### of the temperature data frame starting at row 10 (10, 20, 30, ...)
```
```{r, purl=FALSE}
## Answer
temperature_by_10 <- temperature[seq(10, nrow(temperature), by=10), ]
```
### Conditional subsetting
Besides using the index position of an element in a vector to extract its value
as we saw earlier, we can also use logical vectors:
```{r, purl=FALSE}
animals <- c("mouse", "rat", "dog", "cat")
animals[c(TRUE, FALSE, TRUE, TRUE)]
```
But typically, those logical vectors are not typed by hand but the result of a
logical test:
```{r, purl=FALSE}
animals != "rat"
animals[animals != "rat"]
animals[animals == "cat"]
```
If you can combine multiple tests using `&` (both conditions are true, AND) or `|`
(at least one of the conditions if true, OR):
```{r, purl=FALSE}
animals[animals == "cat" & animals == "rat"] # returns nothing
animals[animals == "cat" | animals == "rat"] # returns both rat and cat
```
If you are trying to combine many conditions, it can become tedious to type. The
function `%in%` allows you to test if a value if found in a vector:
```{r, purl=FALSE}
animals %in% c("rat", "cat", "dog", "duck")
animals[animals %in% c("rat", "cat", "dog", "duck")]
```
In addition to testing equalities, you can also test whether the elements of
your vector are less than or greater than a given value:
```{r, purl=FALSE}
dates <- c(1960, 1963, 1974, 2015, 2016)
dates >= 1974
dates[dates >= 1974]
dates[dates > 1970 & dates <= 2015]
dates[dates < 1975 | dates > 2016]
```
> ### Challenge {.challenge}
>
> * Can you figure out why `"four" > "five"` returns `TRUE`?
```{r, purl=TRUE}
# * Can you figure out why `"four" > "five"` returns `TRUE`?
```
```{r, purl=FALSE}
## Answers
## * When using ">" or "<" on strings, R compares their alphabetical order. Here
## "four" comes after "five", and therefore is "greater than" it.
```