forked from hadley/r-pkgs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPackage-within.Rmd
738 lines (555 loc) · 29.4 KB
/
Package-within.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
# The package within {#sec-package-within}
```{r, echo = FALSE}
source("common.R")
```
```{r, include = FALSE}
# user-facing code is cleaner if we write output files to working directory
# but periodically I clean them up
cleanup <- function() {
files <- fs::dir_ls(regexp = "^[[:digit:]]{4}-.+_clean.csv$")
if (length(files) && rlang::is_interactive()) {
rlang::inform(glue::glue("Deleting {length(files)} '.*_clean.csv' files"))
}
fs::file_delete(files)
invisible()
}
fs::dir_create("Package-within-files")
```
## Introduction
This part of the book ends the same way it started, with the development of a small toy package.
@sec-whole-game established the basic mechanics, workflow, and tooling of package development, but said practically nothing about the R code inside the package.
Here we have a totally different emphasis.
In this chapter, we focus primarily on the package's R code and how it differs from R code in a script.
We start with a data analysis script and show how to find the package that lurks within.
We isolate and then extract reusable data and logic from the script, put this into an R package, and then use that package in a much simplified script.
We make a few rookie mistakes along the way, in order to highlight special considerations for the R code inside a package.
*The section headers incorporate the NATO phonetic alphabet and have no specific meaning. They are just a convenient way to mark our progress towards a working package.*
## Alfa: a script that works
Here is a fictional data analysis script `data-cleaning.R` for a group that collects reports from people who went for a swim:
> Where did you swim and how hot was it outside?
Their data usually comes as a CSV file, which they read into a data frame.
```{cat, engine.opts = list(file = "swim.csv")}
name,where,temp
Adam,beach,95
Bess,coast,91
Cora,seashore,28
Dale,beach,85
Evan,seaside,31
```
```{r}
infile <- "swim.csv"
(dat <- read.csv(infile))
```
They then classify each observation as using American ("US") or British ("UK") English, based on the word chosen to describe the sandy place where the ocean and land meet.
The `where` column is used to build the new `english` column.
```{r}
dat$english[dat$where == "beach"] <- "US"
dat$english[dat$where == "coast"] <- "US"
dat$english[dat$where == "seashore"] <- "UK"
dat$english[dat$where == "seaside"] <- "UK"
```
Sadly, the temperatures are often reported in a mix of Fahrenheit and Celsius.
In the absence of better information, they guess that Americans report temperatures in Fahrenheit and therefore those observations are converted to Celsius.
```{r}
dat$temp[dat$english == "US"] <- (dat$temp[dat$english == "US"] - 32) * 5/9
dat
```
Finally, this cleaned (cleaner?) data is written back out to a CSV file.
They like to capture a timestamp in the filename when they do this[^package-within-1].
[^package-within-1]: `Sys.time()` returns an object of class `POSIXct`, therefore when we call `format()` on it, we are actually using `format.POSIXct()`.
Read the help for [`?format.POSIXct`](https://rdrr.io/r/base/strptime.html) if you're not familiar with such format strings.
```{r}
now <- Sys.time()
timestamp <- format(now, "%Y-%B-%d_%H-%M-%S")
(outfile <- paste0(timestamp, "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile)))
write.csv(dat, file = outfile, quote = FALSE, row.names = FALSE)
```
```{r include = FALSE}
cleanup()
```
Even if your typical analytical tasks are quite different, hopefully you see a few familiar patterns here.
It's easy to imagine that this group does very similar pre-processing of many similar data files over time.
Their analyses can be more efficient and consistent if they make these standard data maneuvers available to themselves as functions in a package, instead of inlining the same data and logic into dozens or hundreds of data ingest scripts.
## Bravo: a better script that works
The package that lurks within the original script is actually pretty hard to see!
It's obscured by a few suboptimal coding practices, such as the use of repetitive copy/paste-style code and the mixing of code and data.
Therefore a good first step is to refactor this code, isolating as much data and logic as possible in proper objects and functions, respectively.
At the same time, we introduce the use of some add-on packages, for several reasons.
First, we would actually use the tidyverse for this sort of data wrangling.
Second, many people use add-on packages in their scripts, so it is good to see how add-on packages are handled as we create this package.
Here's the next version of the script.
```{cat, engine.opts = list(file = "Package-within-files/bravo-data-cleaning.R")}
library(tidyverse)
infile <- "swim.csv"
dat <- read_csv(infile, col_types = cols(name = "c", where = "c", temp = "d"))
lookup_table <- tribble(
~where, ~english,
"beach", "US",
"coast", "US",
"seashore", "UK",
"seaside", "UK"
)
dat <- dat %>%
left_join(lookup_table)
f_to_c <- function(x) (x - 32) * 5/9
dat <- dat %>%
mutate(temp = if_else(english == "US", f_to_c(temp), temp))
dat
now <- Sys.time()
timestamp <- function(time) format(time, "%Y-%B-%d_%H-%M-%S")
outfile_path <- function(infile) {
paste0(timestamp(now), "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
write_csv(dat, outfile_path(infile))
```
```{r, code = readLines("Package-within-files/bravo-data-cleaning.R"), R.options = list(tidyverse.quiet = TRUE)}
```
```{r include = FALSE}
cleanup()
```
The key features to note are:
- We are using functions from tidyverse packages (specifically from readr and dplyr).
- The map between different "beach" words and whether they are considered to be US or UK English is now isolated in a lookup table, which lets us create the `english` column in one go with a `left_join()`. This also makes it easier to add new words in the future
- The `f_to_c()`, `timestamp()`, and `outfile_path()` functions now hold the logic for converting temperatures and forming the timestamped output file name.
It's getting easier to recognize the reusable bits of this script, i.e. the bits that have nothing to do with a specific input file, like `swim.csv`.
This sort of refactoring often happens naturally on the way to creating your own package, but if it does not, it's a good idea to do this intentionally.
## Charlie: external helpers
A typical next step is to move reusable data and logic out of the analysis script and into one or more separate files.
This is a conventional opening move, if you want to use these same helper files in multiple analyses.
Here is the content of `beach-lookup-table.csv`:
```{cat, engine.opts = list(file = "beach-lookup-table.csv")}
where,english
beach,US
coast,US
seashore,UK
seaside,UK
```
```{r comment = "", echo = FALSE}
writeLines(readLines("beach-lookup-table.csv"))
```
Here is the content of `cleaning-helpers.R`:
```{cat, engine.opts = list(file = "Package-within-files/charlie-cleaning-helpers.R"), class.source = "R"}
library(tidyverse)
localize_beach <- function(dat) {
lookup_table <- read_csv(
"beach-lookup-table.csv",
col_types = cols(where = "c", english = "c")
)
left_join(dat, lookup_table)
}
f_to_c <- function(x) (x - 32) * 5/9
celsify_temp <- function(dat) {
mutate(dat, temp = if_else(english == "US", f_to_c(temp), temp))
}
now <- Sys.time()
timestamp <- function(time) format(time, "%Y-%B-%d_%H-%M-%S")
outfile_path <- function(infile) {
paste0(timestamp(now), "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
```
We've added some high-level helper functions, `localize_beach()` and `celsify_temp()`, to the pre-existing helpers (`f_to_c()`, `timestamp()`, and `outfile_path()`).
Here is the next version of the data cleaning script, now that we've pulled out the helper functions (and lookup table).
```{r, include = FALSE}
fs::file_copy(
"Package-within-files/charlie-cleaning-helpers.R",
"cleaning-helpers.R"
)
```
```{cat, engine.opts = list(file = "Package-within-files/charlie-data-cleaning.R")}
library(tidyverse)
source("cleaning-helpers.R")
infile <- "swim.csv"
dat <- read_csv(infile, col_types = cols(name = "c", where = "c", temp = "d"))
(dat <- dat %>%
localize_beach() %>%
celsify_temp())
write_csv(dat, outfile_path(infile))
```
```{r, code = readLines("Package-within-files/charlie-data-cleaning.R"), R.options = list(tidyverse.quiet = TRUE)}
```
```{r include = FALSE}
cleanup()
fs::file_delete("cleaning-helpers.R")
```
You'll notice that the script is getting shorter and, hopefully, easier to read and modify, because repetitive and fussy clutter has been moved out of sight.
Whether the code is actually easier to work with is subjective and depends on how natural the "interface" feels for the people who actually preprocess swimming data.
These sorts of design decisions are the subject of a separate project: [design.tidyverse.org](https://design.tidyverse.org/).
Let's assume the group agrees that our design decisions are promising, i.e. we seem to be making things better, not worse.
Sure, the existing code is not perfect, but this is a typical developmental stage when you're trying to figure out what the helper functions should be and how they should work.
## Delta: an attempt at a package
Let's make a package!
Here's the simplest thing you might hope will "just work": make `cleaning-helpers.R` into an R package.
Somehow.
Concretely, we do this:
- Use `usethis::create_package()` to scaffold a new R package.
- This is a good first step!
- Copy `cleaning-helpers.R` into the new package, specifically, to `R/cleaning-helpers.R`.
- This is morally correct, but mechanically wrong in several ways, as we will soon see.
- Copy `beach-lookup-table.csv` into the new package. But where? Let's try the top-level of the source package.
- This is not going to end well. Shipping data in a package is a special topic, which is covered in @sec-data.
- Install this package.
- Despite all of the problems identified above, this actually works! Which is interesting, because we can (try to) use it and see what happens.
Here's the version of the script that you hope will run after successfully installing this package.
```{cat, engine.opts = list(file = "Package-within-files/delta-main.R"), class.source = "R"}
library(tidyverse)
library(delta)
infile <- "swim.csv"
dat <- read_csv(infile, col_types = cols(name = "c", where = "c", temp = "d"))
dat <- dat %>%
localize_beach() %>%
celsify_temp()
write_csv(dat, outfile_path(infile))
```
The only change from our previous script is that
```{r eval = FALSE}
source("cleaning-helpers.R")
```
has been replaced by
```{r eval = FALSE}
library(delta)
```
Here's what actually happens when we try to run this:
```{r eval = FALSE}
library(tidyverse)
library(delta)
infile <- "swim.csv"
dat <- read_csv(infile, col_types = cols(name = "c", where = "c", temp = "d"))
dat <- dat %>%
localize_beach() %>%
celsify_temp()
#> Error in localize_beach(.) : could not find function "localize_beach"
write_csv(dat, outfile_path(infile))
#> Error in outfile_path(infile) : could not find function "outfile_path"
```
None of our helper functions are actually available for use, even though we call `library(delta)`!
In contrast to `source()`ing a file of helper functions, attaching a package does not dump its functions into the global workspace.
By default, functions in a package are only for internal use.
We need to export `localize_beach()`, `celsify_temp()`, and `outfile_path()` so our users can call them.
In this book, we achieve this by putting `@export` in the special roxygen comment above each function (namespace management is covered in @sec-namespace).
```{r eval = FALSE}
#' @export
celsify_temp <- function(dat) {
mutate(dat, temp = if_else(english == "US", f_to_c(temp), temp))
}
```
Let's say we do that, run `devtools::document()` to (re)generate a `NAMESPACE` file, and re-install the package.
Now when we execute our script, it works!
Correction: it works *sometimes*.
Specifically, it works if and only if the working directory is set to the top-level of the source package.
From any other working directory, we still get an error:
```{r eval = FALSE}
library(tidyverse)
library(delta)
infile <- "swim.csv"
dat <- read_csv(infile, col_types = cols(name = "c", where = "c", temp = "d"))
dat <- dat %>%
localize_beach() %>%
celsify_temp()
#> Error: 'beach-lookup-table.csv' does not exist in current working directory ('/Users/jenny/tmp').
write_csv(dat, outfile_path(infile))
```
The lookup table consulted inside `localize_beach()` cannot be found.
One does not simply dump CSV files into the source of an R package and expect things to "just work".
We will fix this in our next iteration of the package (@sec-data has full coverage of how to include data in a package).
Before we abandon this initial experiment, let's also marvel at the fact that we were able to install, attach, and, to a certain extent, use a fundamentally broken package.
`load_all()` works fine, too!
This is a sobering reminder that you should be running `R CMD check`, probably via `check()`, very often during development.
This will quickly alert you to many problems that simple installation and usage does not reveal.
Indeed, `R CMD check` fails for this package and we see this:
* installing *source* package ‘delta’ ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
Error in library(tidyverse) : there is no package called ‘tidyverse’
Error: unable to load R code in package ‘delta’
Execution halted
ERROR: lazy loading failed for package ‘delta’
* removing ‘/Users/jenny/rrr/delta.Rcheck/delta’
What do you mean "there is no package called 'tidyverse'"?!?
We're using it, with no problems, in our main script!
Also, we've already installed and used this package, why can't `R CMD check` install it?
This error is what happens when the strictness of `R CMD check` meets the very first line of `R/cleaning-helpers.R`:
```{r, eval = FALSE}
library(tidyverse)
```
This is **not** how you declare that your package depends on another package (the tidyverse, in this case).
This is **not** how you make functions in another package available for use in yours.
Dependencies must be declared in `DESCRIPTION` (and that's not all).
Since we declared no dependencies, `R CMD check` takes us at our word and tries to install our package with only the base packages available, which means this `library(tidyverse)` call fails.
A "regular" installation succeeds, simply because the tidyverse is available in your regular library, which hides this particular mistake.
To review, copying `cleaning-helpers.R` to `R/cleaning-helpers.R`, without further modification, was problematic in (at least) these ways:
- Does not account for exported vs. non-exported functions.
- The CSV file holding our lookup table cannot be found in the installed package.
- Does not properly declare our dependency on other add-on packages.
## Echo: a working package
We're ready to make the most minimal version of this package that actually works.
Here is the new version of `R/cleaning-helpers.R`[^package-within-2]:
[^package-within-2]: Putting everything in one file, with this name, is not ideal, but it is technically allowed.
We discuss organising and naming the files below `R/` in @sec-code-organising.
```{cat, engine.opts = list(file = "Package-within-files/echo-cleaning-helpers.R"), class.source = "R"}
lookup_table <- dplyr::tribble(
~where, ~english,
"beach", "US",
"coast", "US",
"seashore", "UK",
"seaside", "UK"
)
#' @export
localize_beach <- function(dat) {
dplyr::left_join(dat, lookup_table)
}
f_to_c <- function(x) (x - 32) * 5/9
#' @export
celsify_temp <- function(dat) {
dplyr::mutate(dat, temp = dplyr::if_else(english == "US", f_to_c(temp), temp))
}
now <- Sys.time()
timestamp <- function(time) format(time, "%Y-%B-%d_%H-%M-%S")
#' @export
outfile_path <- function(infile) {
paste0(timestamp(now), "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
```
We've gone back to defining the `lookup_table` with R code, since our initial attempt to read it from CSV created some sort of filepath snafu.
This is OK for small, internal, static data, but remember to see @sec-data for more general techniques for storing data in a package.
All of our calls to tidyverse functions have now been qualified with the name of the specific package that actually provides the function, e.g. `dplyr::mutate()`.
There are other ways to access functions in another package, explained in @sec-namespace, but this is our recommended default.
It is also our strong recommendation that no one depend on the tidyverse meta-package in a package[^package-within-3].
Instead, it is better to identify the specific package(s) you actually use.
In this case, our package only uses dplyr.
[^package-within-3]: The blog post [The tidyverse is for EDA, not packages](https://www.tidyverse.org/blog/2018/06/tidyverse-not-for-packages/) elaborates on this.
The `library(tidyverse)` call is gone and instead we declare our use of dplyr in the Imports field of `DESCRIPTION`:
Package: echo
(... other lines omitted ...)
Imports:
dplyr
This, together with our use of namespace-qualified calls, like `dplyr::left_join()`, constitutes a valid way to use another package within ours.
The metadata conveyed via DESCRIPTION is covered in @sec-description.
All of the user-facing functions have an `@export` tag in their roxygen comment, which means that `devtools::document()` adds them correctly to the `NAMESPACE` file.
Note that `f_to_c()` is currently only used internally, inside `celsify_temp()`, so we have not exported it (likewise for `timestamp()`).
This version of the package can be installed, used, AND it technically passes `R CMD check`, though with 1 note and 1 warning.
* checking R code for possible problems ... NOTE
celsify_temp: no visible binding for global variable ‘english’
celsify_temp: no visible binding for global variable ‘temp’
Undefined global functions or variables:
english temp
* checking for missing documentation entries ... WARNING
Undocumented code objects:
‘celsify_temp’ ‘localize_beach’ ‘outfile_path’
All user-level objects in a package should have documentation entries.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.
The "no visible binding" note is a peculiarity of using dplyr and unquoted variable names inside a package, where the use of bare variable names (`english` and `temp`) looks suspicious.
We could add either of these lines to any file below `R/` to eliminate this note[^package-within-4]:
[^package-within-4]: For more details, see the [Programming with dplyr vignette](https://dplyr.tidyverse.org/articles/programming.html#eliminating-r-cmd-check-notes).
```{r, eval = FALSE}
# option 1 (then you should also put utils in Imports)
utils::globalVariables(c("english", "temp"))
# option 2
english <- temp <- NULL
```
The warning about missing documentation is because we haven't properly documented our exported functions.
This is a valid concern and something you should absolutely address in a real package.
You've already seen how to create help files with roxygen comments in [the whole game chapter](whole-game-document) and we cover this topic thoroughly in @sec-man.
Therefore, we won't discuss this further here.
## Foxtrot: build time vs. run time {#sec-package-within-build-time-run-time}
The package works, which is great, but group members notice something odd about the timestamps:
```{r, eval = FALSE}
Sys.time()
#> [1] "2022-02-24 20:49:59 PST"
outfile_path("INFILE.csv")
#> [1] "2020-September-03_11-06-33_INFILE_clean.csv"
```
The datetime in the timestamped filename doesn't reflect the time reported by the system.
In fact, the users claim that the timestamp never seems to change at all!
Why is this?
Recall how we form the filepath for output files:
```{r, eval = FALSE}
now <- Sys.time()
timestamp <- function(time) format(time, "%Y-%B-%d_%H-%M-%S")
outfile_path <- function(infile) {
paste0(timestamp(now), "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
```
The fact that we capture `now <- Sys.time()` outside of the definition of `outfile_path()` has probably been vexing some readers for a while.
`now` reflects the instant in time when we execute `now <- Sys.time()`.
In the initial approach, that happened when we `source()`d `cleaning-helpers.R`.
That's not ideal, but it was probably a pretty harmless mistake, because the helper file would be `source()`d shortly before we wrote the outfile.
But this approach is quite devastating in the context of a package.
`now <- Sys.time()` is executed **when the package is built**.
And never again.
It is very easy to subconsciously assume your package code is re-evaluated when the package is installed, attached, or used.
But it is not.
Yes, the code *inside your functions* is absolutely run whenever they are called.
But your functions -- and any other objects created in top-level code below `R/` -- are defined exactly once, at build time.
By defining `now` with top-level code below `R/`, we've doomed our package to timestamp all of its output files with the same (wrong) time.
The fix is to make sure the `Sys.time()` call happens at runtime.
Let's look again at parts of `R/cleaning-helpers.R`:
```{r}
lookup_table <- dplyr::tribble(
~where, ~english,
"beach", "US",
"coast", "US",
"seashore", "UK",
"seaside", "UK"
)
now <- Sys.time()
timestamp <- function(time) format(time, "%Y-%B-%d_%H-%M-%S")
outfile_path <- function(infile) {
paste0(timestamp(now), "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
```
There are four top-level `<-` assignments in this excerpt.
The top-level definitions of the data frame `lookup_table` and the functions `timestamp()` and `outfile_path()` are correct.
It is appropriate that these be defined exactly once, at build time.
The top-level definition of `now`, which is then used inside `outfile_path()`, is regrettable.
Here are better versions of `outfile_path()`:
```{r, eval = FALSE}
# always timestamp as "now"
outfile_path <- function(infile) {
ts <- timestamp(Sys.time())
paste0(ts, "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
# allow user to provide a time, but default to "now"
outfile_path <- function(infile, time = Sys.time()) {
ts <- timestamp(time)
paste0(ts, "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
```
This illustrates that you need to have a different mindset when defining objects inside a package.
The vast majority of those objects should be functions and these functions should generally only use data they create or that is passed via an argument.
There are some types of sloppiness that are fairly harmless when a function is defined immediately before its use, but that can be more costly for functions distributed as a package.
## Golf: side effects {#sec-package-within-side-effects}
The timestamps now reflect the current time, but the group raises a new concern.
As it stands, the timestamps reflect who has done the data cleaning and which part of the world they're in.
The heart of the timestamp strategy is this format string[^package-within-5]:
[^package-within-5]: `Sys.time()` returns an object of class `POSIXct`, therefore when we call `format()` on it, we are actually using `format.POSIXct()`.
Read the help for [`?format.POSIXct`](https://rdrr.io/r/base/strptime.html) if you're not familiar with such format strings.
```{r}
format(Sys.time(), "%Y-%B-%d_%H-%M-%S")
```
This formats `Sys.time()` in such a way that it includes the month *name* (not number) and the local time[^package-within-6].
[^package-within-6]: It would clearly be better to format according to ISO 8601, which encodes the month by number, but please humor me for the sake of making this example more obvious.
Here's such a timestamp produced by a few hypothetical colleagues cleaning some data at exactly the same instant in time.
```{r, include = FALSE}
as_if <- function(time = Sys.time(), LC_TIME = NULL, tz = NULL, ...) {
if (!is.null(LC_TIME)) {
withr::local_locale(c("LC_TIME" = LC_TIME))
}
if (!is.null(tz)) {
withr::local_timezone(tz)
}
format(time, "%Y-%B-%d_%H-%M-%S")
}
colleagues <- tibble::tribble(
~location, ~LC_TIME, ~tz,
"Rome, Italy", "it_IT.UTF-8", "Europe/Rome",
"Warsaw, Poland", "pl_PL.UTF-8", "Europe/Warsaw",
"Sao Paulo, Brazil", "pt_BR.UTF-8", "America/Sao_Paulo",
"Greenwich, England", "en_GB.UTF-8", "Europe/London",
"\"Computer World!\"", "C", "UTC"
)
# I don't want the instant to change everytime we render the book, so NO:
# now <- Sys.time()
# I don't want the fixed instant to have an explicit tzone attribute, so NO:
# now <- as.POSIXct("2020-09-04 22:30:00", tz = "UTC")
# I want a specific, fixed instant, with no explicit tzone
instant <- as.POSIXct(1599258600, origin = "1970-01-01")
colleagues <- colleagues %>%
mutate(timestamp = pmap_chr(., as_if, time = instant), .after = location)
```
```{r, echo = FALSE, results = "asis"}
knitr::kable(colleagues)
```
We see that the month names vary, as does the time, and even the date!
The safest choice is to form timestamps with respect to a fixed locale and time zone (presumably the non-geographic choices represented by "Computer World!" above).
You do some research and learn that you can force a certain locale via `Sys.setlocale()` and force a certain time zone by setting the TZ environment variable.
Specifically, we set the LC_TIME component of the locale to "C" and the time zone to "UTC" (Coordinated Universal Time).
Here's your first attempt to improve `timestamp()`:
```{r include = FALSE}
lc_time <- Sys.getlocale("LC_TIME")
tz <- Sys.getenv("TZ")
```
```{r}
timestamp <- function(time = Sys.time()) {
Sys.setlocale("LC_TIME", "C")
Sys.setenv(TZ = "UTC")
format(time, "%Y-%B-%d_%H-%M-%S")
}
```
But your Brazilian colleague notices that datetimes print differently, before and after she uses `outfile_path()` from your package:
Before:
```{r eval = FALSE}
format(Sys.time(), "%Y-%B-%d_%H-%M-%S")
```
```{r echo = FALSE}
as_if(LC_TIME = "pt_BR.UTF-8", tz = "America/Sao_Paulo")
```
After:
```{r}
outfile_path("INFILE.csv")
format(Sys.time(), "%Y-%B-%d_%H-%M-%S")
```
```{r include = FALSE}
Sys.setlocale("LC_TIME", lc_time)
if (tz == "") {
Sys.unsetenv("TZ")
} else {
Sys.setenv(TZ = tz)
}
```
Notice that her month name switched from Portuguese to English and the time is clearly being reported in a different time zone.
Our calls to `Sys.setlocale()` and `Sys.setenv()` inside `timestamp()` have made persistent (and very surprising) changes to her R session.
This sort of side effect is very undesirable and is extremely difficult to track down and debug, especially in more complicated settings.
Here are better versions of `timestamp()`:
```{r, eval = FALSE}
# use withr::local_*() functions to keep the changes local to timestamp()
timestamp <- function(time = Sys.time()) {
withr::local_locale(c("LC_TIME" = "C"))
withr::local_timezone("UTC")
format(time, "%Y-%B-%d_%H-%M-%S")
}
# use the tz argument to format.POSIXct()
timestamp <- function(time = Sys.time()) {
withr::local_locale(c("LC_TIME" = "C"))
format(time, "%Y-%B-%d_%H-%M-%S", tz = "UTC")
}
# put the format() call inside withr::with_*()
timestamp <- function(time = Sys.time()) {
withr::with_locale(
c("LC_TIME" = "C"),
format(time, "%Y-%B-%d_%H-%M-%S", tz = "UTC")
)
}
```
We show various methods to limit the scope of our changes to LC_TIME and the timezone.
A good rule of thumb is to make the scope of such changes as narrow as is possible and practical.
The `tz` argument of `format()` is the most surgical way to deal with the timezone, but nothing similar exists for LC_TIME.
We make the temporary locale modification using the withr package, which provides a very flexible toolkit for temporary state changes.
This (and `base::on.exit()`) are discussed further in @sec-code-r-landscape.
This underscores a point from the previous section: you need to adopt a different mindset when defining functions inside a package.
Try to avoid making any changes to the user's overall state.
If such changes are unavoidable, make sure to reverse them (if possible) or to document them explicitly (if related to the function's primary purpose).
```{r include = FALSE}
fs::file_delete(c("beach-lookup-table.csv", "swim.csv"))
```
```{=html}
<!--
Loose ends and Unimplemented ideas
Inconsistent w.r.t. voice: is it "they" or "you" or "we" who's writing this package?
Another data cleaning idea was to deal with dysfunctional missing value codes, e.g. where -99 means temp is missing.
Could also show using the degree symbol properly with unicode escape sequence.
-->
```
```{=html}
<!--
Text to possible reuse?
As your use of R becomes more sophisticated, it's common to start to write your own R functions.
If a function is only used in one place, you probably define it right there.
But if you've bothered to write a function, it's likely you want to reuse it in multiple places: within one script, across multiple scripts, or even across multiple projects.
This is *exactly* what an R package is for!
Without package technology, you probably collect these function definitions in one or more dedicated `.R` files and then `source()` them as needed.
Typically these functions co-evolve with the code where you use them, i.e. your analysis code, your Shiny apps, or your R Markdown reports.
If you use these functions across multiple projects, you also face the uncomfortable dilemma of where to define them and whether to have multiple, slightly different copies of this code lying around.
-->
```