-
Notifications
You must be signed in to change notification settings - Fork 23
/
02-import.Rmd
481 lines (289 loc) · 17.4 KB
/
02-import.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
# Import
***
This chapter includes the following recipes:
```{r echo = FALSE, results='asis'}
build_toc("02-import.Rmd")
```
***
## What you should know before you begin {-}
```{block2, type='rmdcaution'}
Before you can manipulate data with R, you need to import the data into R's memory, or build a connection to the data that R can use to access the data remotely. For example, you can build a connection to data that lives in a database.
How you import your data will depend on the format of the data. The most common way to store small data sets is as a plain text file. Data may also be stored in a proprietary format associated with a specific piece of software, such as SAS, SPSS, or Microsoft Excel. Data used on the internet is often stored as a JSON or XML file. Large data sets may be stored in a database or a distributed storage system.
When you import data into R, R stores the data in your computer's RAM while you manipulate it. This creates a size limitation: truly big data sets should be stored outside of R in a database or a distributed storage system. You can then create a connection to the system that R can use to access the data without bringing the data into your computer's RAM.
The readr package contains the most common functions in the tidyverse for importing data. The readr package is loaded when you run `library(tidyverse)`. The tidyverse also includes the following packages for importing specific types of data. These are not loaded with `library(tidyverse)`. You must load them individually when you need them.
* DBI - connect to databases
* haven - read SPSS, Stata, or SAS data
* httr - access data over web APIs
* jsonlite - read JSON
* readxl - read Excel spreadsheets
* rvest - scrape data from the web
* xml2 - read XML
### The working directory
Reading and writing files often involves the use of file paths. If you pass R a partial file path, R will append it to the end of the file path that leads to your _working directory_. In other words, partial file paths are interpretted in relation to your working directory. The working directory can change from session to session, as a general rule, the working directory is the directory where:
* Your .Rmd file lives (if you are running code by knitting the document)
* Your .Rproj file lives (if you are using the RStudio Project system)
* You opened the R interpreter from (if you are using a unix terminal/shell window)
Run `getwd()` to see the file path that leads to your current working directory.
```
## Import data quickly with a GUI
You want to import data quickly, and you do not mind using a semi-reproducible graphical user interface (GUI) to do so. Your data is not so big that it needs to stay in a database or external storage system.
#### Solution {-}
```{r echo = FALSE, fig.align='center', out.width = "80%"}
knitr::include_graphics("images/import-dataset.png")
```
#### Discussion {-}
The RStudio IDE provides an Import Dataset button in the Environment pane, which appears in the top right corner of the IDE by default. You can use this button to import data that is stored in plain text files as well as in Excel, SAS, SPSS, and Stata files.
Click the button to launch a window that includes a file browser (below). Use the browser to select the file to import.
```{r echo = FALSE, fig.align='center', out.width = "80%"}
knitr::include_graphics("images/import-wizard.png")
```
After you've selected a file, RStudio will display a preview of how the file will be imported as a data frame. Below the preview, RStudio provides a GUI interface to the common options for importing the type of file you have selected. As you customize the options, RStudio updates the data preview to display the results.
The bottom right-hand corner of the window displays R code that, if run, will reproduce your importation process programatically. You should copy and save this code if you wish to document your work in a reproducible workflow.
## Read a comma-separated values (csv) file
You want to read a .csv file.
```{r echo = FALSE, fig.align='center'}
knitr::include_graphics("images/readr-csv.png")
```
#### Solution {-}
```{r}
# To make a .csv file to read
write_file(x = "a,b,c\n1,2,3\n4,5,NA", path = "file.csv")
my_data <- read_csv("file.csv")
my_data
```
#### Discussion {-}
Comma separated value (.csv) files are plain text files arranged so that each line contains a row of data and each cell within a line is separated by a comma. Most data analysis software can export their data as .csv files. So if you are in a pinch you can usually export data from a program as a .csv and then read it into R.
You can also use `read_csv()` to import csv files that are hosted at their own unique URL. This only works if you are connected to the internet, e.g.
```{r eval = FALSE}
my_data <- read_csv("https://raw.githubusercontent.com/rstudio-education/tidyverse-cookbook/master/data/mtcars.csv")
```
## Read a semi-colon delimited file
You want to read a file that resembles a [csv][Read a comma-separated values (csv) file], but uses a _semi-colon_ to separate cells.
```{r echo = FALSE, fig.align='center'}
knitr::include_graphics("images/readr-csv2.png")
```
#### Solution {-}
```{r}
# To make a .csv file to read
write_file(x = "a;b;c\n1;2;3\n4;5;NA", path = "file2.csv")
my_data <- read_csv2("file2.csv")
my_data
```
#### Discussion {-}
Semi-colon delimited files are popular in parts of the world that use commas for decimal places, like Europe. Semi-colon delimited files are often still given the .csv file extension.
## Read a tab delimited file
You want to read a file that resembles a [csv][Read a comma-separated values (csv) file], but uses a _tab_ to separate cells.
```{r echo = FALSE, fig.align='center'}
knitr::include_graphics("images/readr-tab.png")
```
#### Solution {-}
```{r}
# To make a .tsv file to read
write_file(x = "a\tb\tc\n1\t2\t3\n4\t5\tNA", path = "file.tsv")
my_data <- read_tsv("file.tsv")
my_data
```
#### Discussion {-}
Tab delimited files are text files that place each row of a table in its own line, and separate each cell within a line with a tab. Tab delimited files commonly have the extensions .tab, .tsv, and .txt.
## Read a text file with an unusual delimiter
You want to read a file that resembles a [csv][Read a comma-separated values (csv) file], but uses _something other than a comma_ to separate cells.
```{r echo = FALSE, fig.align='center'}
knitr::include_graphics("images/readr-delim.png")
```
#### Solution {-}
```{r}
# To make a file to read
write_file(x = "a|b|c\n1|2|3\n4|5|NA", path = "file.txt")
my_data <- read_delim("file.txt", delim = "|")
my_data
```
#### Discussion {-}
Use `read_delim()` as you would [`read_csv()`][Read a comma-separated values (csv) file]. Pass the delimiter that your file uses as a character string to the `delim` argument of `read_delim()`.
## Read a fixed-width file
You want to read a .fwf file, which uses the fixed width format to represent a table (each column begins $n$ spaces from the left for every line).
```{r echo = FALSE, fig.align='center'}
knitr::include_graphics("images/readr-fwf.png")
```
#### Solution {-}
```{r}
# To make a .fwf file to read
write_file(x = "a b c\n1 2 3\n4 5 NA", path = "file.fwf")
my_data <- read_table("file.fwf")
my_data
```
#### Discussion {-}
Fixed width files arrange data so that each row of a table is in its own line, and each cell begins at a _fixed_ number of character spaces from the beginning of the line. For example, in the example below, the cells in the second column each begin
20 spaces from the beginning of the line, no matter many spaces the contents of the first cell required. At first glance, it is easy to mistake fixed width files for tab delimited files. Fixed with files use individual spaces to separate cells, and not tabs.
```
John Smith WA 418-Y11-4111
Mary Hartford CA 319-Z19-4341
Evan Nolan IL 219-532-c301
```
`read_table()` will read fixed-width files where each column is separated by at least one whitespace on every line. Use `read_fwf()` to read fixed-width files that have non-standard formatting.
## Read a file no header
You want to read a file that does not include a line of column names at the start of the file.
```{r echo = FALSE, fig.align='center'}
knitr::include_graphics("images/readr-header.png")
```
#### Solution {-}
```{r}
# To make a .csv file to read
write_file("1,2,3\n4,5,NA","file.csv")
my_data <- read_csv("file.csv", col_names = c("a", "b", "c"))
my_data
```
#### Discussion {-}
If you do not wish to supply column names, set `col_names = FALSE`. The `col_names` argument works for all readr functions that read tabular data.
## Skip lines at the start of a file when reading a file
You want to read a portion of a file, skipping one or more lines at the _start_ of the file. For example, you want to avoid reading the introductory text at the start of a file.
```{r echo = FALSE, fig.align='center'}
knitr::include_graphics("images/readr-skip.png")
```
#### Solution {-}
```{r}
# To make a .csv file to read
write_file("a,b,c\n1,2,3\n4,5,NA","file.csv")
my_data <- read_csv("file.csv", skip = 1)
my_data
```
#### Discussion {-}
Set `skip` equal to the number of lines you wish to skip before you begin reading the file. The `skip` argument works for all readr functions that read tabular data, and can be combined with the [`n_max` argument][Skip lines at the end of a file when reading a file] to skip more lines at the end of a file.
When in doubt, first try reading the file with `read_lines("file.csv")` to determine how many lines you need to skip.
## Skip lines at the end of a file when reading a file
You want to read a portion of a file, skipping one or more lines at the _end_ of the file. For example, you want to avoid reading in a text comment that appears at the end of a file.
```{r echo = FALSE, fig.align='center'}
knitr::include_graphics("images/readr-nmax.png")
```
#### Solution {-}
```{r}
# To make a .csv file to read
write_file("a,b,c\n1,2,3\n4,5,NA","file.csv")
my_data <- read_csv("file.csv", n_max = 1)
my_data
```
#### Discussion {-}
Set `n_max` equal to the number of _non-header_ lines you wish to read before you stop reading lines. The `n_max` argument works for all readr functions that read tabular data.
When in doubt, first try reading the file with `read_lines("file.csv")` to determine how many lines you need to skip.
## Replace missing values as you read a file
You want to convert missing values to NA as you read a file, because they have been recorded with a different symbol.
```{r echo = FALSE, fig.align='center'}
knitr::include_graphics("images/readr-missing.png")
```
#### Solution {-}
```{r}
# To make a .csv file to read
write_file("a,b,c\n1,2,3\n4,5,.","file.csv")
my_data <- read_csv("file.csv", na = ".")
my_data
```
#### Discussion {-}
R's uses the string `NA` to represent missing data. If your file was exported from another language or application, it might use a different convention. R is likely to read in the convention as a character string, failing to understand that it represents missing data. To prevent this, explicitly tell `read_csv()` which strings are used to represent missing data.
The `na` argument works for all readr functions that read tabular data.
## Read a compressed RDS file
You want to read an .RDS file.
#### Solution {-}
```{r}
# To make a .RDS file to read
saveRDS(pressure, file = "file.RDS")
my_data <- readRDS("file.RDS")
my_data
```
#### Discussion {-}
`readRDS` is a base R function—you do not need to run `library(tidyverse)` to use it. RDS is a file format native to R for saving compressed content. RDS files are _not_ text files and are not human readable in their raw form. Each RDS file contains a single object, which makes it easy to assign its output directly to a single R object. This is not necessarily the case for .RData files, which makes .RDS files safer to use.
## Read an Excel spreadsheet
You want to read an Excel spreadsheet.
#### Solution {-}
```{r}
library(readxl)
# File path to an example excel spreadsheet to import
file_path <- readxl_example("datasets.xlsx")
my_data <- read_excel(file_path)
my_data
```
#### Discussion {-}
By default, `read_excel()` reads the _first_ sheet in an Excel spreadsheet. See the [next recipe][Read a specific sheet from an Excel spreadsheet] to read in sheets other than the first.
`read_excel()` is loaded with the readxl package, which must be loaded with `library(readxl)`. Pass `read_excel()` a file path or URL that leads to the Excel file you wish to read. `readxl_example("datasets.xlsx")` returns a file path that leads to an example Excel spreadsheet that is conveniently installed with the readxl package.
## Read a specific sheet from an Excel spreadsheet
You want to read a specific sheet from an Excel spreadsheet.
#### Solution {-}
```{r}
library(readxl)
# File path to an example excel spreadsheet to import
file_path <- readxl_example("datasets.xlsx")
my_data <- read_excel(file_path, sheet = "chickwts")
my_data
```
#### Discussion {-}
To read a specific sheet from a larger Excel spreadsheet, use the `sheet` argument to pass [`read_excel()`][Read an Excel spreadsheet] the name of the sheet as a character string. You can also pass `sheet` the number of the sheet to read. For example, `read_excel(file_path, sheet = 2)` would read the second spreadsheet in the file.
If you are unsure of the name of a sheet, use the `excel_sheets()` function to list the names of the spreadsheets within a file, without importing the file, e.g. `excel_sheets(file_path)`.
## Read a field of cells from an excel spreadsheet
You want to read a subset of cells from an Excel spreadsheet.
#### Solution {-}
```{r}
library(readxl)
# File path to an example excel spreadsheet to import
file_path <- readxl_example("datasets.xlsx")
my_data <- read_excel(file_path, range = "C1:E4", skip = 3, n_max = 10)
my_data
```
#### Discussion {-}
`read_excel()` adopts the [`skip`][Skip lines at the start of a file when reading a file] and [`n_max][Skip lines at the end of a file when reading a file] arguments of reader functions to skip rows at the top of the spreadsheet and to control how far down the spreadsheet to read. Use the `range` argument to specify a subset of columns to read. Two column names separated by a `:` specifies those two columns and every column between them.
## Write to a comma-separate values (csv) file
You want to save a tibble or data frame as a .csv file.
#### Solution {-}
```{r eval = FALSE}
write_csv(iris, path = "my_file.csv")
```
#### Discussion {-}
`write_csv()` is about twice as fast as base R's `write.csv()` and never adds rownames to a table. Tidyverse users tend to place important information in its own column, where it can be easily accessed, instead of in the rownames, where it cannot be as easily accessed.
R will save your file at the location described by appending the `path` argument to your [working directory][What you should know before you begin]. If your path contains directories, these must exist _before_ you run `write_csv()`.
## Write to a semi-colon delimited file
You want to save a tibble or data frame as a .csv file _that uses semi-colons to delimit cells_.
#### Solution {-}
```{r eval = FALSE}
write_csv2(iris, path = "my_file.csv")
```
#### Discussion {-}
R will save your file at the location described by appending the `path` argument to your [working directory][What you should know before you begin]. If your path contains directories, these must exist _before_ you run `write_csv2()`.
## Write to a tab-delimited file
You want to save a tibble or data frame as a tab delimited file.
#### Solution {-}
```{r eval = FALSE}
write_tsv(iris, path = "my_file.tsv")
```
#### Discussion {-}
R will save your file at the location described by appending the `path` argument to your [working directory][What you should know before you begin]. If your path contains directories, these must exist _before_ you run `write_tsv()`.
## Write to a text file with arbitrary delimiters
You want to save a tibble or data frame as a plain text file that uses an unusual delimiter.
#### Solution {-}
```{r eval = FALSE}
write_delim(iris, path = "my_file.tsv", delim = "|")
```
#### Discussion {-}
`write_delim()` behaves like [`write_csv()`][Write to a comma-separate values (csv) file], but requires a `delim` argument. Pass `delim` the delimiter you wish to use to separate cells, as a character string.
R will save your file at the location described by appending the `path` argument to your [working directory][What you should know before you begin]. If your path contains directories, these must exist _before_ you run `write_delim()`.
## Write to a compressed RDS file
You want to write a tibble or data frame to a .RDS file.
#### Solution {-}
```{r}
saveRDS(iris, file = "my_file.RDS")
```
#### Discussion {-}
`saveRDS()`, along with `readRDS()`, is a base R function, which explains the difference in the argument names as well as the read/write naming pattern compared to readr functions—take note!
R will save your file at the location described by appending the `path` argument to your [working directory][What you should know before you begin]. If your path contains directories, these must exist _before_ you run `saveRDS()`. Use the .RDS file extension in the file name to help make the uncommon file format obvious.
```{r include = FALSE}
files <- c(
"file.RDS",
"file.csv",
"file.fwf",
"file.tsv",
"file.txt",
"file2.csv"
)
map(files, ~if(file.exists(.x)) file.remove(.x))
```
<!--
You want to
#### Solution {-}
#### Discussion {-}
-->