forked from hadley/r-pkgs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
r.rmd
314 lines (213 loc) · 18.3 KB
/
r.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
---
layout: default
title: R code
output: bookdown::html_chapter
---
# R code
The most important part of most packages is the `R/` directory because it contains all of your R code. We'll start with `R/`, because even if you do nothing else, putting your R files in this directory gives you some useful tools.
In this chapter you'll learn:
* How to organise the R code in your package.
* Your first package development workflow.
* What happens when you install a package.
* The difference between a library and package.
## Getting started {#getting-started}
The easiest way to get started with a package is to run `devtools::create("path/to/package/pkgname")`. This makes the package directory, `path/to/package/pkgname/`, then adds four things to make the smallest usable package:
1. An RStudio project file, `pkgname.Rproj`.
1. An `R/` directory.
1. A basic `DESCRIPTION` file.
1. A basic `NAMESPACE` file.
In this chapter, you'll learn about the `R/` directory and the RStudio project file. Ignore the other files for now. You'll learn about the `DESCRIPTION` in [package metadata](#description) and the `NAMESPACE` in [namespaces](#namespace).
__Never__ use `package.skeleton()` to create a package. It's designed for an older era of package development, and mostly serves to make your life harder, not easier. Currently I don't recommend using RStudio's "create a new package" tool because it uses `package.skeleton()`. That will be fixed by the time the book is published.
The first principle of using a package is that all R code goes in `R/`. If you have existing code for your new package, now's a good time to copy it into `R/`.
## RStudio projects {#projects}
To get started with your new package in RStudio, double-click the `package.Rproj` file that `create()` just made. This will open a new RStudio project for your package. Projects are a great way to develop packages because:
* Each project is isolated; they keep unrelated things unrelated.
* You get handy code navigation tools like `F2` to jump to a function
definition and `Ctrl + .` to look up functions by name.
* You get useful keyboard shortcuts for common package development tasks.
You'll learn these throughout the book, but to see them all press
Alt + Shift + K or use the Help | Keyboard shortcuts menu.
`create()` makes an `.Rproj` file for you. If you want to add one to an existing package, use `devtools::use_rstudio("path/to/package")`. If you don't use RStudio, you can get many of the benefits by starting a new R session and ensuring the working directory is set to the project directory.
`.Rproj` files are just text files. The project file created by devtools looks like this:
```
Version: 1.0
RestoreWorkspace: No
SaveWorkspace: No
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
Encoding: UTF-8
AutoAppendNewline: Yes
StripTrailingWhitespace: Yes
BuildType: Package
PackageUseDevtools: Yes
PackageInstallArgs: --no-multiarch --with-keep.source
PackageRoxygenize: rd,collate,namespace
```
Never modify this file by hand. Instead, use the friendly project options dialog, accessible from the projects menu in the top-right corner of RStudio.
```{r, echo = FALSE}
bookdown::embed_png("screenshots/project-options-1.png", dpi = 220)
bookdown::embed_png("screenshots/project-options-2.png", dpi = 220)
```
## Organising and running code {#r-code}
The first advantage of using a package is that it's easy to load all of the code in your package. There are two main options:
* `devtools::load_all()`, __Cmd + Shift + L__, reloads all code in the package.
In RStudio, this also saves all open files, saving you a key press.
* Build & reload, __Cmd + Shift + B__. This is only available in RStudio, because
it installs the package, then restarts R, then loads the package with
`library()` (doing this by hand is painful).
These commands support a fluid development workflow:
1. Edit R files in the editor.
1. Press Cmd + Shift + L (or Cmd + Shift + B).
1. Explore the code in the console.
1. Rinse and repeat.
You're free to arrange functions into files however you wish. It's clear that the two extremes are bad: don't put all functions in one file, or every function in its own file. My rule of thumb is that if I can't remember which file a function lives in, I probably need to split them up into more files, or give them better names. It's ok if some files only contain one function, particularly if the function is large or has a lot of documentation. Unfortunately you can't use subdirectories inside in `R/`. The next best thing is to use a common prefix, e.g., `abc-*.R`.
The exact placement of functions within files is less important if you master two important keyboard shortcuts that let you jump to the definition of a function:
* Click a function name in code and press __F2__.
* Press __Ctrl + .__ then start typing the name.
Congratulations, you now understand the basics of using a package! In the rest of this chapter, you'll learn more about the various forms of a package, and exactly what happens when you run `install.packages()` or `install_github()`.
### Avoid side effects {#side-effects}
One big difference between a script and R code in a package is that the code in a package should not side effects. In other words, your code should only create objects (mostly functions), and outside of functions, you should not call functions that affect global state:
* Don't use `library()` or `require()` - instead use the
[DESCRIPTION](description.html) to say what your package depends on.
* Don't modify global `options()` or graphics `par()`. Instead, wrap changes
in functions that the user can call when needed.
* Don't save files to disk with `write()`, `write.csv()`, or `saveRDS()`.
Instead, use [data/](data.html) to cache important data files.
There are two reasons to avoid side-effects. The first is pragmatic: these funtions will work while you're developing a package locally with `load_all()`, but they won't work once you distribute your package. That's because your R code is run once when the package is built, not every time `library()` is called. The second is principled: you shouldn't change global state behind your users' backs.
Occassionally, packages do need side-effects. This is most common if your package talks to an external system - you might need to do some initial setup when the package loads. To do that, you can use the special functions `.onLoad()` and `.onAttach()`. These are called when the package is loaded or attached. You'll learn about the distinction between the two in [Namespaces](#namespace), but for now alway use `.onLoad()` unless I explicitly say otherwise. If you use `.onLoad()`, consider using `.onUnload()` to clean up any side effects. By convention, `.onLoad()` and friends are usually saved in a file called `zzz.R`.
Some common use of `.onLoad()` and `.onAttach()` are:
* To dynamically load a compiled DLL. You no longer need to use `.onLoad()`
for this and can instead use a special namespace construct; see
[namespaces](#namespace) for details.
* Display an informative message when the package loads. This might make
usage conditions clear, or display useful tips. Startup messages is one
place where you should use `.onAttach()` instead of `.onLoad()`. To display
startup messages, always use `packageStartupMessage()`, and not `message()`.
(This allows `suppressPackageStartupMessages()` to selectively suppress
package startup messages).
```{r, eval = FALSE}
.onAttach <- function(libname, pkgname) {
packageStartupMessage("Welcome to my package")
}
```
* Connect R to another programming language. For example, if you use RJava
to connect a `.jar` file, you need to call `rJava::.jpackage()`. To
make C++ classes available as reference classes in R with RCpp modules,
you call `Rcpp::loadRcppModules()`.
* To register vignette engines with `tools::vignetteEngine()`.
* Set custom options for your package with `options()`. To avoid conflicts
with other packages, ensure that your prefix option names with the name
of your package. Also be careful not to override options that the user
has already set.
I use the following code in devtools to set up useful options:
```{r, eval = FALSE}
.onLoad <- function(libname, pkgname) {
op <- options()
op.devtools <- list(
devtools.path = "~/R-dev",
devtools.install.args = "",
devtools.name = "Your name goes here",
devtools.desc.author = '"First Last <[email protected]> [aut, cre]"',
devtools.desc.license = "What license is it under?",
devtools.desc.suggests = NULL,
devtools.desc = list()
)
toset <- !(names(op.devtools) %in% names(op))
if(any(toset)) options(op.devtools[toset])
invisible()
}
```
You should __never__ use `.onLoad()` or `.onAttach()` to load packages with `require()` or `library()`.
Note that `.First.lib()` and `.Last.lib()` are old versions of `.onLoad()` and `.onUnload()` and should be replaced with the newer system.
Another type of side-effect is defining S4 classes, methods and generics. R packages capture these side-effect so they can be replayed when the package is loaded, but they need to be called in the right order. For example, before you can define a method, you must have defined both the generic and the classes. This requires that the R files be sourced in a specific order, which controlled by the `Collate` field in the `DESCRIPTION`. This is described in more detail in [collation](#collate).
### CRAN notes {#r-cran}
If you're planning on submitting your package to CRAN, you must use only ASCII characters in your `.R` files. You can still include unicode characters in strings, but you need to use the special unicode escape `"\u1234"` format. The easiest way to do that is to use `stringi::stri_escape_unicode()`
```{r}
x <- "This is a bullet •"
y <- "This is a bullet \u2022"
identical(x, y)
cat(stringi::stri_escape_unicode(x))
```
Your R directory should not include any files other than R code. Subdirectories will be silently ignored.
## What is a package? {#package}
To develop a packge all you need is the workflow above. But to master package development, particularly distributing your package to others, you need to understand more about the different forms a package can take. This will help you understand exactly what happens when you do `install.packages()`.
So far we've just worked with a __source__ package: the development version of a package that lives on your computer. A source package is just a directory with components like `R/`, `DESCRIPTION`, and so on. There are three other types of package: bundled, binary and installed.
A package __bundle__ is a compressed version of a package in a single file. By convention, package bundles in R use the extension `.tar.gz`. This is Linux convention indicating multiple files have been collapsed into a single file (`.tar`) and then compressed using gzip (`.gz`). A bundle is not that useful in its own right, but is often used an intermediary in other steps. It's rare that you'll need it, but you can call `devtools::build()` to make a package bundle. If you decompress a bundle, you'll see it looks almost the same as your source package. The main differences between a decompressed bundle and a source package are:
* Vignettes, package level documentation, are built turning raw files like
markdown or latex into output like html and pdf.
* Your source package might contain temporary files used to save time during
development, like compilation artefacts in `src/`.
* Any files listed in `.Rbuildignore` are not included in the bundle. You'll
learn more about `.Rbuildignore` in XXX.
If you want to distribute your package to another R user (i.e. someone who doesn't necessarily have the development tools installed) you need to make a __binary__ package. Like a package bundle, a binary package is a single file, but if you uncompress it, you'll see that the internal structure is a little different than a source package:
* There are no `.R` files in the `R/` directory - instead there are three
files that store the parsed functions in an efficient format. This is
basically the result of loading all the R code and then saving the
functions with `save()`, but with a little extra metadata to make things as
fast as possible.
* A `Meta/` directory contains a number of `Rds` files. These contain cached
metadata about the package, like what topics the help files cover and
parsed versions of the `DESCRIPTION` files. (You can use `readRDS()` to see
exactly what's in those files). These files make package loading faster
by caching costly computations.
* A `html/` directory contains some files needed for HTML help.
* If you had any code in the `src/` directory there will now be a `libs/`
directory that contains the results of compiling that code for 32 bit
(`i386/`) and 64 bit (`x64/`).
* The contents of `inst/` have been moved into the top-level directory.
Binary packages are platform specific: you can't install a Windows binary package on a Mac or vice versa. Mac binary packages end in `.tgz` and Windows binary packages end in `.zip`. You can use `devtools::build(binary = TRUE)` to make a binary package.
The following diagram summarises the files present in the root directory for the source, bundled and binary versions of devtools.
```{r, echo = FALSE}
bookdown::embed_png("diagrams/package-files.png")
```
(Needs diagram instead of table)
* `cran-comments.md` and `devtools.Rproj` are included in `src` but nothing
else. That's because they're only needed for development, and are listed in
`.Rbuildignore`.
* `inst/` goes away and `templates/` appears - that's because building a
binary copies the contents of `inst/` into the root directory. You'll learn
more about that in XXX.
### Exercises
1. Go to CRAN and download the source and binary for XXX. Unzip and compare.
How do they differ?
1. Download the __source__ packages for XXX, YYY, ZZZ. What directories do they
contain?
## Package installation {#install}
An __installed__ package is just a binary package that's been uncompressed into a package library, described next. The following diagram describes the many ways a package can be installed. This diagram is complicated! In an ideal world installing a package would involve stringing together a set of simple steps: source -> bundle, bundle -> binary, binary -> installed, installed -> in memory. It's not this simple in the real world because doing each step in sequence is slow, and there are often faster short cuts available.
```{r, echo = FALSE}
bookdown::embed_png("diagrams/installation.png")
```
The power house is the command line tool `R CMD install` - it can install a source, bundle or a binary package. When you install a package from CRAN, R basically downloads the file and then runs `R CMD install` on it.
Devtools functions wrap the base R functions so that you can access them from R, rather than the command line. `install()` is effectively just a wrapper for `R CMD install`. `build()` is a wrapper for `R CMD build` that turns source packages into bundles. `install_github()` downloads a source package from github, runs `build()` to make vignettes, then uses `R CMD install` to install. `install_url()`, `install_gitorious()`, `install_bitbucket()` work similarly for packages found elsewhere on the internet.
RStudio's "Build and reload" performs one additional step - it also loads the package into memory. `load_all()` is similar, except that it skips all intermediate steps - it goes directly from a source package to a loaded in-memory package. `library()` is the equivalent for installed packages.
## What is a library? {#library}
A collection of packages is called a library. This is a bit confusing because you use the `library()` function to load a package, but the distinction between libraries and packages is important and useful. A library is just a directory containing installed packages. You can have multiple libraries on your computer and almost everyone has at least two: one for the recommended packages that come with a base R install (like `base`, `stats` etc), and one library where the packages you've installed live. Normally, that second directory varies based on the version of R that you're using. That's why it seems like you "lose" all your packages when you reinstall R - they're actually still on your hard drive, but R can't find them.
You can use `.libPaths()` to see which libraries are currently active. Here are mine:
```{r, eval = FALSE}
.libPaths()
#> [1] "/Users/hadley/R"
#> [2] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library"
lapply(.libPaths(), dir)
#> [[1]]
#> [1] "AnnotationDbi" "ash" "assertthat"
#> ...
#> [163] "xtable" "yaml" "zoo"
#>
#> [[2]]
#> [1] "base" "boot" "class" "cluster"
#> [5] "codetools" "compiler" "datasets" "foreign"
#> [9] "graphics" "grDevices" "grid" "KernSmooth"
#> [13] "lattice" "MASS" "Matrix" "methods"
#> [17] "mgcv" "nlme" "nnet" "parallel"
#> [21] "rpart" "spatial" "splines" "stats"
#> [25] "stats4" "survival" "tcltk" "tools"
#> [29] "translations" "utils"
```
My first lib path is where packages that I've installed live (I've installed at lot!), and the second is where packages that come with R live. These are the so called "recommended" packages available with every install of R.
When you use `library(pkg)` to load a package, R looks through each path in `.libPaths()` to see if a directory called `pkg` exists.
Packrat, which we'll learn about in Chapter XXX, automates the process of managing project specific libraries. This means that when you upgrade a package in one project, it only affects that project, not every project on your computer. This is useful because it means you can play around with cutting-edge packages in one place, but all your other projects continue to use the old reliable packages. This is also useful when you're both developing and using a package.
### Exercises
1. Where is your default library? What happens when to that library when
you install a new package from CRAN?
1. Can you have multiple version of the same package installed at the same
time?