forked from cambiotraining/r-intermediate
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2.dplyr-intro.Rmd
461 lines (297 loc) · 13.1 KB
/
2.dplyr-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
---
title: "Introduction to dplyr"
author: "Mark Dunning"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output: html_document
---
# Overview of this section


> Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham)
http://vimeo.com/33727555
http://vita.had.co.nz/papers/tidy-data.pdf
- tidy data
+ what it means
+ how to create it
- **`select`** and **`mutate`** verbs in dplyr
- advice for cleaning strings and dates

As we saw, our data, and especially data we might find out "in the wild", may need "cleaning"" before we can use it. We will discuss how to clean our data for analysis. However, there are steps you can take to make sure your data are tidy and organised before importing into R.
- [Karl Broman's tutorial on Data Organisation](http://kbroman.org/dataorg/)
- ["Data Carpentry" workshop](http://lgatto.github.io/2015-09-10-CAM/)
We have already seen a ***tidy*** dataset in the form of `iris`. As saw, it was possible to produce visualisations of the data using the `~` syntax.
```{r}
head(iris)
```
## Example (from tidyr paper)
Consider the following....
```{r echo=FALSE,results='asis'}
library(knitr)
untidy <- data.frame(Name = c("John Smith","Jane Doe","Mary Johnson"),treatmenta = c("-",16,3),treatmentb=c(2,11,1))
kable(untidy)
```
Data in this format are quite familiar to us, but not easily-interpretable by the computer. We need to think of the dataset in terms of *variables* and *values*. In this example dataset we have *18* values and *3* variables
1. Person (John, Jane or Mary)
2. Treatment (A or B)
3. "Result"
The guiding principles:
- Each column is a variable
- Each row is an observation
- Each type of observational unit forms a table
The tidy form of this data thus becomes;
```{r echo=FALSE,warning=FALSE}
library(tidyr)
suppressPackageStartupMessages(library(dplyr))
gather(untidy, Treatment, Result, treatmenta:treatmentb) %>% mutate(Treatment = gsub("treatment","",Treatment))
```
`gather` is a function in the `tidyr` package that can collapse multiple columns into key-value pairs.
- similar to `stack` in base R
- or `melt` in the `reshape` / `reshape2` pacakges
```{r}
library(tidyr)
untidy <- read.delim("tidyr-example.txt")
untidy
```
The arguments are;
- the data frame you want to manipulate
- the name of the *key* column you want to create
- the name of the *value* column you want to create
- the *names* of the columns in the data frame that are to be collapsed
+ note you don't need "" to define the column names
```{r warning=FALSE}
gather(untidy, Treatment, Result, c(treatmenta,treatmentb))
```
When specifying columns, an alternative is to ignore some column names
```{r warning=FALSE}
gather(untidy, Treatment, Result, -Name)
```
Can also specify column index
```{r warning=FALSE}
gather(untidy, Treatment, Result, 2:3)
```
Note that after all these operations, the original data frame (`untidy`) is unaltered
- we need to create a new variable to save the result
- you should never work directly on the raw data
```{r}
untidy
tidy <- gather(untidy, Treatment, Result, c(treatmenta,treatmentb))
```
******
## Example: Simulated clinical data
Using data from the Mango Solutions training dataset
```{r}
library(mangoTraining)
data("messyData")
```
```{r results='asis',echo=FALSE}
library(mangoTraining)
data("messyData")
kable(head(messyData))
```
- What variables and observations do we have?
- What might a 'tidy' version of the dataset look like?
- Use the `tidyr` package to create a tidy version of the data
******
## Tidy form
```{r echo=FALSE, results='asis',message=FALSE}
library(tidyr)
tidy <- gather(messyData,Treatment, Count, -Subject)
```
`separate` from the same package will also sort out the treatment names.
```{r eval=FALSE}
separate(tidy,Treatment,into=c("Treatment","Rep"))
```
```{r echo=FALSE}
head(separate(tidy,Treatment,into=c("Treatment","Rep")))
```
N.B. `spread` has the opposite effect to gather and will translate tidy data back into human-readable form
```{r eval=FALSE}
spread(tidy, Treatment,Count)
```
```{r echo=FALSE}
head(spread(tidy, Treatment,Count))
```
## dpylr introduction, and more data-cleaning
We are going to explore some of the basic features of `dplyr` using some patient data; the kind of data that we might encounter in the wild. However, rather than using real-life data we are going to make some up. There is a package called `wakefield` that is particularly convenient for this task.
Various patient characteristics have been generated and saved in the file `patient-data.txt`.
```{r}
patients <- read.delim("patient-data.txt")
```
In Rstudio , you can view the contents of this data frame in a tab.
```{r eval=FALSE}
View(patients)
```
Using `dplyr`, `patients` can be converted to a special kind of data frame; the `tbl_df`. The main difference is that the whole data frame does not get printed to screen. All operations that we could do on a data frame can be applied to a `tbl_df`.
```{r}
library(dplyr)
patients <- tbl_df(patients)
patients
```
The `tbl_df` function in dplyr creates a new object with more-convenient default display
- The object behaves in exactly the same way as a data frame
- `glimpse` can be used to get a snapshot of your data
+ similar to `str` function in base R
```{r}
glimpse(patients)
```
## Selecting variables; the `select` verb
As we mentioned, `dplyr` introduces a series of ***verbs*** that can be used to operate on our data. The first we will introduce is *`select`*

The default dispaly of a `tbl_df` object is to print the data frame in an intelligent manner, so not to flood the R console. If we want to control what variables are printed we can use the `select` function, which allows us to print certain columns to the screen.
In the R introduction, we saw how to extract columns from a data frame with a `$` symbol.
- Remember that we're not changing the data frame, just exploring it
```{r}
patients$Name
```
The `$` operator works fine for single columns, but for multiple columns it quickly gets a bit messy as we have to resort to using `[]`
```{r}
patients[,c("Name","Race")]
```
Non R users probably find these commands a bit obtuse
+ why the `[ ]`?
+ what is `c`?
+ need to remember the row and column index
+ `[,...]` means display all rows
However, `select` is a lot more intuitive; easier to convert to the sentance "Select the Name column from patients"
- Notice the lack of `""`
```{r}
select(patients, Name)
```
*"Select the Name and Race columns"*
```{r}
select(patients, Name, Race)
```
Excluding certain columns in base R would require us to know the *index* we wanted to exclude, or to do something complicated with the column names
```{r eval=FALSE}
patients[,-2]
patients[,setdiff(colnames(patients), "Name")]
```
```{r eval=FALSE}
head(patients[,-2])
head(patients[,setdiff(colnames(patients), "Name")])
```
Compare to dplyr version....*"Select all columns apart from Name"*
```{r}
select(patients, -Name)
```
Similarly easier to get columns within a particular range. *"Select all columns between Name and Sex"*
```{r}
select(patients, Name:Sex)
```
Say we cannot remember the name of the column
```{r}
select(patients, starts_with("Grade"))
```
see also
- contains, matches, ends_with
## Transforming / cleaning the data
We can now begin to explore the data. Imagine that we want to do a simple barplot to the number of males and females. Using base graphics, we can do this with;
```{r}
barplot(table(patients$Sex))
```
******
- What do you think the problem might be?
- Try printing out all values for this column, or using the `nchar` function to determine the problem
******
So we notice that the Sex column has "whitespace"; not an uncommon problem if data have been entered manually.
```{r}
table(select(patients, Sex))
as.character(patients$Sex)
```
The base distribution of R has many functions for dealing with strings. However, the `stringr` package (again from Hadley Wickham) aims to provide a clean modern interface to many common string operations.
Of particular interest here is the `str_trim` (`?str_trim`) function that will trim whitespace from a vector of strings. Arguments can be used to specify whether to remove whitespace from either side of the string, or just the left or right.
```{r}
library(stringr)
str_trim(patients$Sex)
```
Once we have worked-out how to created a cleaned version of the variable, we need to update the data frame. This can be done using the next of the **dplyr verbs**; `mutate`.
## Creating new variables: The `mutate` verb
`mutate` allows us to compute and append one or more new columns to a data frame. We can create new, cleaned, data frame and overwrite the Sex column
```{r}
mutate(patients, Sex = str_trim(Sex))
patients_clean <- mutate(patients, Sex = factor(str_trim(Sex)))
```
Rather than removing whitespace, sometimes we might want to adjust a set of strings so that they have a fixed width. We can do this be adding a *padding* character to each string if necessary. This operations is often done with a set of identifiers.
```{r}
str_pad(patients$ID,pad = "0",width=3)
patients_clean <- mutate(patients_clean, ID=str_pad(patients_clean$ID,pad = "0",width=3))
```
Similarly, we might want to trim all the strings between a particular start and end position
```{r}
str_sub(patients$State, 1,3)
str_sub(patients$Name, 2,5)
```
Finally, we can convert entire strings to upper- or lower-case
```{r}
str_to_lower(patients$Pet)
str_to_upper(patients$Name)
```
## String Replacement
We might want to look at the distribution of heights amongst our patients. However, if we try this we are immediately hit with an error message
```{r eval=FALSE}
hist(patients$Height)
```
```{r echo=FALSE}
err <- try(hist(patients$Height),silent = TRUE)
cat(err)
```
The problem is that R thinks that values are not numbers. Why might that be?
```{r}
select(patients,Height)
```
To convert the values back to numbers we first need to remove the `"cm"` at the end of each value. We can do this with a *string replacement* operation. In this case, the particular string we want to put in the place of `"cm"` is the *empty string* `""`. The replacement can be done using the `str_replace_all` function.
```{r}
mutate(patients_clean, Height= str_replace_all(patients_clean$Height,pattern = "cm",""))
```
Note that the input to this function is the `patients_clean` object that we created previously.
However, what do you notice about the class reported for the Height column after this operation.
+ [compare to the `Count` column, for example]
We need an extra conversion step to make sure these are represented as *numeric* values.
```{r}
patients_clean <- mutate(patients_clean, Height= as.numeric(str_replace_all(patients_clean$Height,pattern = "cm","")))
hist(patients_clean$Height)
```
Another potential issue in the data is the way in which the *grade* (i.e. tumour grade) of each patient has been recorded. Some data-entry schemes might use improbable values (such as 99) to encode missing values. Obviously this can introduce errors with some statistical analyses. What we probably want is to use the special `NA` value in R
```{r}
patients$Grade_Level
```
```{r}
grds <- patients$Grade_Level
ifelse(grds == 99, NA,grds)
patients_clean <- mutate(patients_clean, Grade = ifelse(grds == 99, NA,grds))
```
## Dealing with dates
Dates are often problematic as they may be encoded in lots of different schemes
If we wanted the year in which wach person was born it would be tempting to *trim* the first four characters of the string. However, getting the month or day might be a bit more problematic.
```{r}
str_sub(patients$Birth,1,4)
```
The `lubridate` package has lots of useful functions for dealing with dates (also by Hadley Wickham!). For example it allows the year, month and day to be extracted.
```{r}
library(lubridate)
year(patients$Birth)
```
It also supporting some clever manipulation, such as being able to subtract and add times.
```{r}
dob <- ymd(patients$Birth)
today <- ymd("20160509")
age <- interval(dob, today)
as.period(age)
patients_clean <- mutate(patients_clean, Age = year(as.period(age)))
patients_clean
```
## Exercise
- We want to calculate the Body Mass Index (BMI) for each of our patients
- $BMI = (Weight) / (Height^2)$
+ where Weight is measured in Kilograms, and Height in Metres
- A BMI of 25 is considered overweight, calculate a new variable to indicate which individuals are overweight
- For a follow-on study, we are interested in overweight smokers
+ clean the `Smokes` column to contain just `TRUE` or `FALSE` values
- (EXTRA) What other problems can you find in the data?
```{r echo=FALSE}
patients_clean <- mutate(patients_clean, Weight = as.numeric(str_replace_all(patients_clean$Weight,"kg","")))
patients_clean <- mutate(patients_clean, BMI = (Weight/(Height/100)^2), Overweight = BMI > 25)
patients_clean <- mutate(patients_clean, Smokes = str_replace_all(Smokes, "Yes", TRUE))
patients_clean <- mutate(patients_clean, Smokes = as.logical(str_replace_all(Smokes, "No", FALSE)))
patients_clean
```