-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcm005.Rmd
416 lines (277 loc) · 10.5 KB
/
cm005.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
---
title: "cm005 Notes and Exercises"
date: "September 19, 2017"
output:
html_document:
toc: true
---
```{r}
library(tidyverse)
library(gapminder)
```
## 1. A quick look at data types in R
### 1.1 Overview
Data types are summarized nicely in [JB's small slideshow](https://speakerdeck.com/jennybc/simple-view-of-r-objects).
### 1.2 Investigation using R
We've seen that R understands numbers -- a _data type_. What other type of objects does R recognize? Let's identify them, and test them with the `typeof` function. There are
- `Double`s (as we've seen -- they're numbers)
```{r}
typeof(5.6)
```
- `Integer`s: (special type of number -- need an `L` at the end)
```{r}
typeof(4L)
typeof(4)
```
- `logical`s are true/false (known as `boolean`s in other languages). We highly recommend using TRUE/FALSE instead of T/F (you can assign T/F as variables, as we'll see later).
```{r}
typeof(TRUE)
typeof(T)
typeof(FALSE)
typeof(F)
```
- `character`s:
```{r}
typeof("Hi, my name is Vincenzo.")
```
__Exercise__: Find the data type of `"5"`. Explain the output.
```{r}
typeof("5")
```
__Exercise__: Use the `as.numeric` function to convert `"5"` to an object of type `"double"`.
```{r}
as.numeric("5")
```
__Exercise__: Describe this output:
```{r}
typeof(typeof(15.2))
##The output of the type of is a string which is considered a character under the outer type of command.
```
## 2. Data Frames in R
### 2.1 "Care and Feeding of Data": Exercises
Last class, we went through the [Care and Feeding of Data in R](http://stat545.com/block006_care-feeding-data.html) tutorial. Let's do some exercises, this time with the `iris` dataset (comes pre-loaded in the `datasets` package). Use any method you'd like to answer the following questions.
```{r}
library(datasets)
```
1. How many variables (columns) are in the `iris` dataset, and what are their names?
```{r}
dim(iris)
```
A: There are 5 variables.
2. How many rows are in the data set?
A: There are 150 rows.
3. What are the smallest values of each numeric variable?
```{r}
summary(iris)
```
4. Extract the `Petal.Width` column to get a vector of observations (we'll see vectors in more detail in a later class), and
(a) Make a histogram
(b) Make a table of frequencies
```{r}
x<-iris$Petal.Width
hist(x)
table(x)
```
### 2.2 `dplyr` fundamentals
Next, we'll learn about `dplyr`, a handy R package for manipulating data frames, through six main functions. According to the [R for data science](http://r4ds.had.co.nz/transform.html) book, the main five (in different order) are:
> - Pick observations by their values (`filter()`).
> - Pick variables by their names (`select()`).
- (We'll sneak "piping" into here)
> - Reorder the rows (`arrange()`).
> - Create new variables with functions of existing variables (`mutate()`).
> - Collapse many values down to a single summary (`summarise()`).
Then there's `group_by()`, which can be used in conjunction with all of these.
We'll get through as much as we can in this class, and will continue in cm006, possibly with some more exercises and more features of `dplyr`.
Resources underpinning this section can be found in two parts: [part 1](http://stat545.com/block009_dplyr-intro.html) and [part 2](http://stat545.com/block010_dplyr-end-single-table.html).
#### `filter`
`filter` subsets data frames according to some logical expression.
Basic example:
```{r}
filter(gapminder, country=="Canada")
```
Logical expressions are governed by __relational operators__, that output either `TRUE` or `FALSE` based on the validity of the statement you give it. Here's a summary of the operators:
| Operation | Outputs `TRUE` or `FALSE` based on the validity of the statement... |
| ------ | ----- |
| `a == b` | `a` is equal to `b` |
| `a != b` | `a` is not equal to `b`. |
| `a > b` | `a` is greater than `b`. |
| `a < b` | `a` is less than `b`. |
| `a >= b` | `a` is greater than or equal to `b`. |
| `a <= b` | `a` is less than or equal to `b`. |
| `a %in% b` | `a` is an element in `b`. |
Let's see some examples:
```{r}
5 == 3
c(1, 2, 3) < 3 # vectorized!
4 %in% c(1, 2, 3, 4, 5)
my_equality <- 5 == 3
print(my_equality)
```
There are __logical__ operators too, and they follow Boolean Algebra. They're listed below -- the first three are fundamental, but the others are useful too.
| Operation | Outputs `TRUE` or `FALSE` based on the validity of the statement... |
| ------ | ----- |
| `a & b`, `a && b` | Both `a` __and__ `b` are `TRUE` |
| `a | b`, `a || b` | Either `a` __or__ `b` is `TRUE`. |
| `!a` | `a` is __not__ `TRUE` (in other words, take the complement or "flip" `a`) |
| `xor(a, b)` | Either `a` or `b` is `TRUE`, but not both. |
| `all(a,b,c,...)` | `a`, `b`, `c`, ... are __all__ `TRUE`. |
| `any(a,b,c,...)` | __Any__ one of `a`, `b`, `c`, ... is `TRUE`. |
We can `filter` by more than one condition:
```{r}
#filter(gapminder, country=="Canada" & year < 1980) # same as...
filter(gapminder, country=="Canada", year < 1970)
filter(gapminder, country=="Canada" | year == 1952)
```
From the Part 1 notes... never do the following!
> ```
> excerpt <- gapminder[241:252, ]
> ```
>
> Why is this a terrible idea?
>
> - It is not self-documenting. What is so special about rows 241 through 252?
> - It is fragile. This line of code will produce different results if someone changes the row order of `gapminder`, e.g. sorts the data earlier in the script.
__Exercises__: Let's try some exercises using the `gapminder` data set.
1. Find all entries of Canada and Algeria.
```{r}
filter(gapminder,country== "Canada" | country=="Algeria")
```
2. Find all entries of Canada and Algeria, occuring in the '60s.
```{r}
filter(gapminder, country=="Canada" | country=="Algeria",year %in% 1960:1969)
```
3. Find all entries of Canada, and entries of Algeria occuring in the '60s.
```{r}
filter(gapminder, country=="Canada" | (country=="Algeria" & year %in% 1960:1969))
```
4. Find all entries _not_ including European countries.
```{r}
filter(gapminder,continent != "Europe")
```
#### `select`
`select` subsets data by columns/variable names.
```{r}
select(gapminder, continent, country)
```
- Always returns a tibble.
- Drop variables with `-`.
- Note that re-ordering happens here.
#### piping
What if we wanted to do more than one operation? For example:
- take all entries of Canada and Algeria occuring in the '60s, and
- select the `country`, `year`, and `gdpPercap` columns.
We could do...
```{r}
select(filter(gapminder,
country %in% c("Canada", "Algeria"),
year <= 1969, year >= 1960),
country, year, gdpPercap)
```
But the chain of functions can get quite long and hard to read.
The __pipe__ operator `%>%` feeds the output of a function into another function. Syntax:
```
gapminder %>%
f1() %>%
f2() %>%
f3(options=something)
```
This says:
1. Start with the `gapminder` data, then
2. apply `f1` to it, then
3. apply `f2`, then
4. apply `f3` with the `options=something` argument.
The same operation above becomes:
```{r}
gapminder %>%
filter(country %in% c("Canada", "Algeria"),
year <= 1969, year >= 1960) %>%
select(country, year, gdpPercap)
```
You can read this as:
1. start with the `gapminder` data, then
2. take all entries of Canada and Algeria occuring in the '60s (`filter`), then
3. select the `country`, `year`, and `gdpPercap` columns.
__Exercise__: Take all countries in Europe that have a GPD per capita greater than 10000, and select all variables except `gdpPercap`. (Hint: use `-`).
```{r}
gapminder %>%
filter(continent=="Europe",gdpPercap>10000) %>%
select(-gdpPercap)
```
#### `arrange`
`arrange` sorts a data frame by shuffling the order of the rows appropriately. Use `desc` to sort by descending order.
Order `gapminder` by population, then life expectancy:
```{r}
arrange(gapminder, pop, lifeExp)
```
__Exercises__:
1. Order the data frame by year, then descending by life expectancy.
```{r}
arrange(gapminder,year,desc(lifeExp))
```
2. In addition to the above exercise, rearrange the variables so that `year` comes first, followed by life expectancy. (Hint: check the documentation for the `select` function for a related handy function).
```{r}
gapminder %>%
arrange(year,desc(lifeExp)) %>%
select(year,lifeExp,everything())
```
#### `mutate`
`mutate` creates a new variable by calculating from other variables. Let's get GDP by multiplying GPD per capita with population:
```{r}
mutate(gapminder, gdp = gdpPercap * pop)
```
You can define multiple new variables -- even back-dependent on new ones! Let's also create a column for GDP in billions, rounded to one decimal:
```{r}
mutate(gapminder,
gdp = gdpPercap * pop,
gdpBill = round(gdp/1000000000, 1))
```
`transmute` works the same way, but drops all other variables.
__Exercise__: Make a new column called `cc` that pastes the country name followed by the continent, separated by a comma. (Hint: use the `paste` function with the `sep=", "` argument).
```{r}
mutate(gapminder,cc=paste(country,continent,sep=", "))
```
#### `summarize` and `group_by`
`summarize` reduces a tibble according to summary statistics.
```{r}
summarize(gapminder, mean_pop=mean(pop), sd_pop=sd(pop))
```
Not very useful by itself! But, with the `group_by` function, the `summarize` function is very useful:
```{r}
gapminder %>%
group_by(country) %>%
summarize(mean_pop=mean(pop), sd_pop=sd(pop))
```
The `group_by` function splits the tibble into parts -- in the above case, by country. Notice the "Groups" indicator in the following output:
```{r}
group_by(gapminder, continent, country, year < 1970)
```
Let's get a summary of this grouping:
```{r}
(out1 <- gapminder %>%
group_by(continent, country, year < 1970) %>%
summarize(mean_pop=mean(pop), sd_pop=sd(pop)))
```
Note that the output is still a tibble, but one "layer" of grouping has been peeled back: the `year < 1970` variable. `summarize` again and you'll see that the tibble was no longer grouped by `year < 1970`.
```{r}
out1 %>%
summarize(mean_pop=mean(mean_pop))
```
__Exercise__: Find the minimum GDP per capita experienced by each country
```{r}
gapminder %>%
group_by(country) %>%
summarize(min(gdpPercap))
```
__Exercise__: How many years of record does each country have?
```{r}
gapminder %>%
group_by(country) %>%
summarize(count=n())
```
__Exercise__: Within Asia, what are the min and max life expectancies experienced in each year?
```{r}
gapminder %>%
filter(continent=="Asia") %>%
group_by(year) %>%
summarize(minexp=min(lifeExp),maxexp=max(lifeExp))
```