-
Notifications
You must be signed in to change notification settings - Fork 8
/
04-intro-to-data-frames.Rmd
345 lines (254 loc) · 9.28 KB
/
04-intro-to-data-frames.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
---
title: "Introduction to Data Frames"
subtitle: "Stat 133"
author: "Gaston Sanchez"
output: github_document
fontsize: 11pt
urlcolor: blue
---
> ### Learning Objectives
>
> - Understand Data Frames
> - Basic Manipulation with brackets `[ , ]`
------
## Manipulating Data Frames
The most common format/structure for a data set is a tabular format:
with row and columns (like a spreadsheet). When your data is in this shape,
most of the time you will work with R __data frames__ (or similar rectangular
structures like a `"matrix"`, `"table"`, etc).
Learning how to manipulate data tables is among the most important
_data computing_ basic skills. The traditional way of manipulating data frames
in R is based on bracket notation, e.g. `dat[ , ]`, to select specific
rows, columns, or cells. Also, the use of the dollar `$` operator to handle
columns is fundamental. In this part of the lab, you will practice a wide
array of data wrangling tasks with the so-called bracket notation, and the
dollar operator.
I should say that there are alternative ways for manipulating tables in R.
Among the most recent paradigms, there is the __plying__ framework devised
by Hadley Wickham. From his doctoral research, the first _plyr_ tools were
available in the packages `"plyr"` and `"reshape"`. Nowadays we have the
`"reshape2"`package, and the extremely popular package `"dplyr"`
(among other packages). You will have time to learn more about `"dplyr"` in the
next weeks. In the meantime, take some time to understand more about the
bracket notation.
## R Data Frames
A data frame is a special type of R list, in which each column is an R vector
(or a factor).
When working with data frames, you should always spend some time inspecting
the contents, and checking how R is handling the data types. It is in these
early stages of data exploration that you can catch potential issues in order
to avoid disastrous consequences or bugs in subsequent stages.
What `str()` returns is a display
of the dimensions of the data frame, and then a list with the name of all the
variables, and their data types (e.g. `chr` character, `num` real, etc).
The argument `vec.len = 1` indicates that just the first element in each
column should be displayed.
## Creating data frames
Most of the (raw) data tables you will be working with will already be in
some data file. However, from time to time you will face the need of creating
some sort of data table in R. In these situations, you will likely have to
create such table with a data frame. So let's look at various ways to
"manually"" create a data frame.
__Option 1__: The primary option to build a data frame is with `data.frame()`.
You pass a series of vectors (or factors), of the same length, separated by commas.
Each vector (or factor) will become a column in the generated data frame.
Preferably, give names to each column, like `col1`, `col2`, and `col3`, in the
example below:
```{r create_data_frame1}
# creating a basic data frame
my_table1 <- data.frame(
col1 = LETTERS[1:5],
col2 = seq(from = 10, to = 50, by = 10),
col3 = c(TRUE, TRUE, FALSE, TRUE, FALSE)
)
my_table1
```
__Option 2__: Another way to create data frames is with a `list` containing
vectors or factors (of the same length), which then you convert to a data.frame
with `data.frame()`:
```{r create_data_frame2}
# another way to create a basic data frame
my_list <- list(
col1 = LETTERS[1:5],
col2 = seq(from = 10, to = 50, by = 10),
col3 = c(TRUE, TRUE, FALSE, TRUE, FALSE)
)
my_table2 <- data.frame(my_list)
my_table2
```
Remember that a `data.frame` is nothing more than a `list`. So as long as the
elements in the list (vectors or factors) are of the same length, we can simply
convert the list into a data frame.
By default, `data.frame()` converts character vectors into factors. You can
check that by exmining the structure of the data frame with `str()`:
```{r}
str(my_table2)
```
To prevent `data.frame()` from converting strings into factors, you must use
the argument `stringsAsFactors = FALSE`
```{r}
# strings as strings (not as factors)
my_table3 <- data.frame(
col1 = LETTERS[1:5],
col2 = seq(from = 10, to = 50, by = 10),
col3 = c(TRUE, TRUE, FALSE, TRUE, FALSE),
stringsAsFactors = FALSE
)
str(my_table3)
```
## Basic Operations with Data Frames
Now that you have seen some ways to create data frames, let's discuss a number
of basic manipulations of data frames. I will show you some examples and then
you'll have the chance to put in practice the following operations:
- Selecting table elements:
+ select a given cell
+ select a set of cells
+ select a given row
+ select a set of rows
+ select a given column
+ select a set of columns
- Adding a new column
- Deleting a new column
- Renaming a column
- Moving a column
- Transforming a column
```{r echo = FALSE}
tbl <- data.frame(
player = c('Thompson', 'Curry', 'Green', 'Durant', 'Pachulia'),
position = c('SG', 'PG', 'PF', 'SF', 'C'),
salary = c(16663575, 12112359, 15330435, 26540100, 2898000),
points = c(1742, 1999, 776, 1555, 426),
ppg = c(22.3, 25.3, 10.2, 25.1, 6.1),
rookie = rep(FALSE, 5),
stringsAsFactors = FALSE
)
```
Let's say you have a data frame `tbl` with the lineup of the Golden State Warriors:
```{r, echo = FALSE, comment = ""}
tbl
```
### Selecting elements
The data frame `tbl` is a 2-dimensional object: the 1st dimension corresponds
to the rows, while the 2nd dimension corresponds to the columns.
Because `tbl` has two dimensions, the bracket notation involves
working with the data frame in this form: `tbl[ , ]`.
In other words, you have to specify values inside the
brackets for the 1st index, and the 2nd index: `tbl[index1, index2]`.
```{r}
# select value in row 1 and column 1
tbl[1,1]
# select value in row 2 and column 5
tbl[2,5]
# select values in these cells
tbl[1:3,3:5]
```
If no value is specified for `index1` then all rows are included. Likewise,
if no value is specified for `index2` then all columns are included.
```{r}
# selecting first row
tbl[1, ]
# selecting third row
tbl[3, ]
# selecting second column
tbl[ ,2]
# selecting columns 3 to 5
tbl[ ,3:5]
```
### Adding a column
Perhaps the simplest way to add a column is with the dollar operator `$`.
You just need to give a name for the new column, and assign a vector (or factor):
```{r}
# adding a column
tbl$new_column <- c('a', 'e', 'i', 'o', 'u')
tbl
```
Another way to add a column is with the _column binding_ function `cbind()`:
```{r}
# vector of weights
weight <- c(215, 190, 230, 240, 270)
# adding weights to tbl
tbl <- cbind(tbl, weight)
tbl
```
### Deleting a column
The inverse operation of adding a column consists of __deleting__ a column.
This is possible with the `$` dollar operator. For instance, say you want to
remove the column `new_column`. Use the `$` operator to select this column,
and assign it the value `NULL` (think of this as _NULLifying_ a column):
```{r}
# deleting a column
tbl$new_column <- NULL
tbl
```
### Renaming a column
What if you want to rename a column? There are various options to do this.
One way is by changing the column`names` attribute:
```{r}
# attributes
attributes(tbl)
```
which is more commonly accessed with the `names()` function:
```{r}
# column names
names(tbl)
```
Notice that `tbl` has a list of attributes. The element `names` is the vector
of column names.
You can directly modify the vector of `names`; for example let's change
`rookie` to `rooky`:
```{r}
# changing rookie to rooky
attributes(tbl)$names[6] <- "rooky"
# display column names
names(tbl)
```
By the way: this way of changing the name of a variable is very low level, and probably
unfamiliar to most useRs.
### Moving a column
A more challenging operation is when you want to move a column to a different
position. What if you want to move `salary` to the last position (last column)?
One option is to create a vector of column names in the desired order, and then
use this vector (for the index of columns) to reassign the data frame like this:
```{r}
reordered_names <- c("player", "position", "points", "ppg", "rooky", "weight", "salary")
# moving salary at the end
tbl <- tbl[ ,reordered_names]
tbl
```
### Transforming a column
A more common operation than deleting or moving a column, is to transform the
values in a column. This can be easily accomplished with the `$` operator.
For instance, let's say that we want to transform `salary` from dollars to
millions of dollars:
```{r}
# converting salary in millions of dollars
tbl$salary <- tbl$salary / 1000000
tbl
```
Likewise, instead of using the `$` operator, you can refer to the column using
bracket notation. Here's how to transform weight from pounds to kilograms
(1 pound = 0.453592 kilograms):
```{r}
# weight in kilograms
tbl[ ,"weight"] <- tbl[ ,"weight"] * 0.453592
tbl
```
There is also the `transform()` function which transform values _interactively_,
that is, temporarily:
```{r}
# transform weight to inches
transform(tbl, weight = weight / 0.453592)
```
`transform()` does its job of modifying the values of `weight` but only
temporarily; if you inspect `tbl` you'll see what this means:
```{r}
# did weight really change?
tbl
```
To make the changes permanent with `transform()`, you need to reassign them
to the data frame:
```{r}
# transform weight to inches (permanently)
tbl <- transform(tbl, weight = weight / 0.453592)
tbl
```