-
Notifications
You must be signed in to change notification settings - Fork 118
/
04_Subsetting.Rmd
executable file
·169 lines (117 loc) · 5.75 KB
/
04_Subsetting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
```{r, include = FALSE}
source("common.R")
```
# Subsetting
<!-- 4 -->
\stepcounter{section}
## Selecting multiple elements
<!-- 4.2 -->
__[Q1]{.Q}__: Fix each of the following common data frame subsetting errors:
```{r, eval = FALSE}
mtcars[mtcars$cyl = 4, ]
# use `==` (instead of `=`)
mtcars[-1:4, ]
# use `-(1:4)` (instead of `-1:4`)
mtcars[mtcars$cyl <= 5]
# `,` is missing
mtcars[mtcars$cyl == 4 | 6, ]
# use `mtcars$cyl == 6` (instead of `6`)
# or `%in% c(4, 6)` (instead of `== 4 | 6`)
```
__[Q2]{.Q}__: Why does the following code yield five missing values? (Hint: why is it different from `x[NA_real_]`?)
```{r}
x <- 1:5
x[NA]
```
__[A]{.solved}__: In contrast to `NA_real`, `NA` has logical type and logical vectors are recycled to the same length as the vector being subset, i.e. `x[NA]` is recycled to `x[c(NA, NA, NA, NA, NA)]`.
__[Q3]{.Q}__: What does `upper.tri()` return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?
```{r echo = TRUE, results = 'hide'}
x <- outer(1:5, 1:5, FUN = "*")
x[upper.tri(x)]
```
__[A]{.solved}__: `upper.tri(x)` returns a logical matrix, which contains `TRUE` values above the diagonal and `FALSE` values everywhere else. In `upper.tri()` the positions for `TRUE` and `FALSE` values are determined by comparing `x`'s row and column indices via `.row(dim(x)) < .col(dim(x))`.
```{r}
x
upper.tri(x)
```
When subsetting with logical matrices, all elements that correspond to `TRUE` will be selected. Matrices extend vectors with a dimension attribute, so the vector forms of subsetting can be used (including logical subsetting). We should take care, that the dimensions of the subsetting matrix match the object of interest — otherwise unintended selections due to vector recycling may occur. Please also note, that this form of subsetting returns a vector instead of a matrix, as the subsetting alters the dimensions of the object.
```{r}
x[upper.tri(x)]
```
__[Q4]{.Q}__: Why does `mtcars[1:20]` return an error? How does it differ from the similar `mtcars[1:20, ]`?
__[A]{.solved}__: When subsetting a data frame with a single vector, it behaves the same way as subsetting a list of columns. So, `mtcars[1:20]` would return a data frame containing the first 20 columns of the dataset. However, as `mtcars` has only 11 columns, the index will be out of bounds and an error is thrown. `mtcars[1:20, ]` is subsetted with two vectors, so 2d subsetting kicks in, and the first index refers to rows.
__[Q5]{.Q}__: Implement your own function that extracts the diagonal entries from a matrix (it should behave like `diag(x)` where `x` is a matrix).
__[A]{.solved}__: The elements in the diagonal of a matrix have the same row- and column indices. This characteristic can be used to create a suitable numeric matrix used for subsetting.
```{r}
diag2 <- function(x) {
n <- min(nrow(x), ncol(x))
idx <- cbind(seq_len(n), seq_len(n))
x[idx]
}
# Let's check if it works
(x <- matrix(1:30, 5))
diag(x)
diag2(x)
```
__[Q6]{.Q}__: What does `df[is.na(df)] <- 0` do? How does it work?
__[A]{.solved}__: This expression replaces the `NA`s in `df` with `0`. Here `is.na(df)` returns a logical matrix that encodes the position of the missing values in `df`. Subsetting and assignment are then combined to replace only the missing values.
## Selecting a single element
<!-- 4.3 -->
__[Q1]{.Q}__: Brainstorm as many ways as possible to extract the third value from the `cyl` variable in the `mtcars` dataset.
__[A]{.solved}__: Base R already provides an abundance of possibilities:
```{r}
# Select column first
mtcars$cyl[[3]]
mtcars[ , "cyl"][[3]]
mtcars[["cyl"]][[3]]
with(mtcars, cyl[[3]])
# Select row first
mtcars[3, ]$cyl
mtcars[3, "cyl"]
mtcars[3, ][ , "cyl"]
mtcars[3, ][["cyl"]]
# Select simultaneously
mtcars[3, 2]
mtcars[[c(2, 3)]]
```
__[Q2]{.Q}__: Given a linear model, e.g. `mod <- lm(mpg ~ wt, data = mtcars)`, extract the residual degrees of freedom. Extract the R squared from the model summary (`summary(mod)`).
__[A]{.solved}__: `mod` is of type list, which opens up several possibilities. We use `$` or `[[` to extract a single element:
```{r}
mod <- lm(mpg ~ wt, data = mtcars)
mod$df.residual
mod[["df.residual"]]
```
The same also applies to `summary(mod)`, so we could use, e.g.:
```{r}
summary(mod)$r.squared
```
(Tip: The [`{broom}` package](https://github.com/tidymodels/broom) [@broom] provides a very useful approach to work with models in a tidy way.)
\stepcounter{section}
## Applications
<!-- 4.5 -->
__[Q1]{.Q}__: How would you randomly permute the columns of a data frame? (This is an important technique in random forests.) Can you simultaneously permute the rows and columns in one step?
__[A]{.solved}__: This can be achieved by combining `[` and `sample()`:
```{r,eval = FALSE}
# Permute columns
mtcars[sample(ncol(mtcars))]
# Permute columns and rows in one step
mtcars[sample(nrow(mtcars)), sample(ncol(mtcars))]
```
__[Q2]{.Q}__: How would you select a random sample of `m` rows from a data frame? What if the sample had to be contiguous (i.e. with an initial row, a final row, and every row in between)?
__[A]{.solved}__: Selecting `m` random rows from a data frame can be achieved through subsetting.
```{r, eval = FALSE}
m <- 10
mtcars[sample(nrow(mtcars), m), ]
```
Holding successive lines together as a blocked sample requires only a certain amount of caution in order to obtain the correct start and end index.
```{r, eval = FALSE}
start <- sample(nrow(mtcars) - m + 1, 1)
end <- start + m - 1
mtcars[start:end, , drop = FALSE]
```
__[Q3]{.Q}__: How could you put the columns in a data frame in alphabetical order?
__[A]{.solved}__: We combine `[` with `order()` or `sort()`:
```{r, eval = FALSE}
mtcars[order(names(mtcars))]
mtcars[sort(names(mtcars))]
```