10_Function_factories.Rmd

```{r, include = FALSE}
source("common.R")
```

# Function factories
<!-- 10 -->

## Prerequisites {-}
<!-- 10.0 -->

For most of this chapter base R [@RLanguage] is sufficient. Just a few exercises require the `{rlang}` [@rlang], `{dplyr}` [@dplyr], `{purrr}`  [@purrr] and `{ggplot2}` [@ggplot2] packages.

```{r, message = FALSE}
library(rlang)
library(dplyr)
library(purrr)
library(ggplot2)
```

\stepcounter{section}
## Factory fundamentals
<!-- 10.2 -->

__[Q1]{.Q}__: The definition of `force()` is simple:

```{r}
force
```

Why is it better to `force(x)` instead of just `x`?

__[A]{.solved}__: As you can see `force(x)` is similar to `x`. As mentioned in *Advanced R*, we prefer this explicit form, because

> using this function clearly indicates that you’re forcing evaluation, not that you’ve accidentally typed `x`."

__[Q2]{.Q}__: Base R contains two function factories, `approxfun()` and `ecdf()`. Read their documentation and experiment to figure out what the functions do and what they return.

__[A]{.solved}__: Let's begin with `approxfun()` as it is used within `ecdf()` as well:

`approxfun()` takes a combination of data points (x and y values) as input and returns a stepwise linear (or constant) interpolation function. To find out what this means exactly, we first create a few random data points.

```{r}
x <- runif(10)
y <- runif(10)
plot(x, y, lwd = 10)
```

Next, we use `approxfun()` to construct the linear and constant interpolation functions for our `x` and `y` values.

```{r}
f_lin <- approxfun(x, y)
f_con <- approxfun(x, y, method = "constant")

# Both functions exactly reproduce their input y values
identical(f_lin(x), y)
identical(f_con(x), y)
```

When we apply these functions to new x values, these are mapped to the lines connecting the initial y values (linear case) or to the same y value as for the next smallest initial x value (constant case).

```{r}
x_new <- runif(1000)

plot(x, y, lwd = 10)
points(x_new, f_lin(x_new), col = "cornflowerblue", pch = 16)
points(x_new, f_con(x_new), col = "firebrick", pch = 16)
```

However, both functions are only defined within `range(x)`.

```{r}
f_lin(range(x))
f_con(range(x))

(eps <- .Machine$double.neg.eps)

f_lin(c(min(x) - eps, max(x) + eps))
f_con(c(min(x) - eps, max(x) + eps))
```

To change this behaviour, one can set `rule = 2`. This leads to the result that for values outside of `range(x)` the boundary values of the function are returned.

```{r}
f_lin <- approxfun(x, y, rule = 2)
f_con <- approxfun(x, y, method = "constant", rule = 2)

f_lin(c(-Inf, Inf))
f_con(c(-Inf, Inf))
```

Another option is to customise the return values as individual constants for each side via `yleft` and/or `yright`.

```{r}
f_lin <- approxfun(x, y, yleft = 5)
f_con <- approxfun(x, y, method = "constant", yleft = 5, yright = -5)

f_lin(c(-Inf, Inf))
f_con(c(-Inf, Inf))
```

Further, `approxfun()` provides the option to shift the y values for `method = "constant"` between their left and right values. According to the documentation this indicates a compromise between left- and right-continuous steps.

```{r}
f_con <- approxfun(x, y, method = "constant", f = .5)

plot(x, y, lwd = 10)
points(x_new, f_con(x_new), pch = 16)
```

Finally, the `ties` argument allows to aggregate y values if multiple ones were provided for the same x value. For example, in the following line we use `mean()` to aggregate these y values before they are used for the interpolation `approxfun(x = c(1,1,2), y = 1:3, ties = mean)`.

Next, we focus on `ecdf()`. "ecdf" is an acronym for empirical cumulative distribution function. For a numeric vector of density values, `ecdf()` initially creates the (x, y) pairs for the nodes of the density function and then passes these pairs to `approxfun()`, which gets called with specifically adapted settings (`approxfun(vals, cumsum(tabulate(match(x, vals)))/n, method = "constant", yleft = 0, yright = 1, f = 0, ties = "ordered")`). 

```{r}
x <- runif(10)
f_ecdf <- ecdf(x)
class(f_ecdf)

plot(x, f_ecdf(x), lwd = 10, ylim = 0:1)
```

New values are then mapped on the y value of the next smallest x value from within the initial input.

```{r}
x_new <- runif(1000)

plot(x, f_ecdf(x), lwd = 10, ylim = 0:1)
points(x_new, f_ecdf(x_new), ylim = 0:1)
```

__[Q3]{.Q}__: Create a function `pick()` that takes an index, `i`, as an argument and returns a function with an argument `x` that subsets `x` with `i`.

```{r, eval = FALSE}
pick(1)(x)
# should be equivalent to
x[[1]]

lapply(mtcars, pick(5))
# should be equivalent to
lapply(mtcars, function(x) x[[5]])
```

__[A]{.solved}__: In this exercise `pick(i)` acts as a function factory, which returns the required subsetting function.

```{r}
pick <- function(i) {
  force(i)
  
  function(x) x[[i]]
}

x <- 1:3
identical(x[[1]], pick(1)(x))
identical(
  lapply(mtcars, function(x) x[[5]]),
  lapply(mtcars, pick(5))
)
```

__[Q4]{.Q}__: Create a function that creates functions that compute the i^th^ [central moment](http://en.wikipedia.org/wiki/Central_moment) of a numeric vector. You can test it by running the following code:

```{r, eval = FALSE}
m1 <- moment(1)
m2 <- moment(2)

x <- runif(100)
stopifnot(all.equal(m1(x), 0))
stopifnot(all.equal(m2(x), var(x) * 99 / 100))
```

__[A]{.solved}__: The first moment is closely related to the mean and describes the average deviation from the mean, which is 0 (within numerical margin of error). The second moment describes the variance of the input data. If we want to compare it to `var()`, we need to undo [Bessel's correction](https://en.wikipedia.org/wiki/Bessel%27s_correction) by multiplying with $\frac{N-1}{N}$.

```{r}
moment <- function(i) {
  force(i)
  
  function(x) sum((x - mean(x)) ^ i) / length(x)
}

m1 <- moment(1)
m2 <- moment(2)

x <- runif(100)
all.equal(m1(x), 0)  # removed stopifnot() for clarity
all.equal(m2(x), var(x) * 99 / 100)
```

__[Q5]{.Q}__: What happens if you don't use a closure? Make predictions, then verify with the code below.

```{r}
i <- 0
new_counter2 <- function() {
  i <<- i + 1
  i
}
```

__[A]{.solved}__: Without the captured and encapsulated environment of a closure the counts will be stored in the global environment. Here they can be overwritten or deleted as well as interfere with other counters.

```{r, error = TRUE}
new_counter2()
i
new_counter2()
i

i <- 0
new_counter2()
i
```

__[Q6]{.Q}__: What happens if you use `<-` instead of `<<-`? Make predictions, then verify with the code below.

```{r}
new_counter3 <- function() {
  i <- 0
  function() {
    i <- i + 1
    i
  }
}
```

__[A]{.solved}__: Without the super assignment `<<-`, the counter will always return 1. The counter always starts in a new execution environment within the same enclosing environment, which contains an unchanged value for `i` (in this case it remains 0).

```{r}
new_counter_3 <- new_counter3()

new_counter_3()
new_counter_3()
```

## Graphical factories
<!-- 10.3 -->

__[Q1]{.Q}__: Compare and contrast `ggplot2::label_bquote()` with `scales::number_format()`.

__[A]{.solved}__: Both functions will help you in styling your output, e.g. in your plots and they do this by returning the desired formatting function to you.

`ggplot2::label_bquote()` takes relatively straightforward [plotmath](https://stat.ethz.ch/R-manual/R-patched/library/grDevices/html/plotmath.html) expressions and uses them for faceting labels in `{ggplot2}`. Because this function is used in `{ggplot2}` it needs to return a function of `class = "labeller"`.

`scales::number_format()` initially `force()`s the computation of all parameters. It's essentially a parametrised wrapper around `scales::number()` and will help you format numbers appropriately. It will return a simple function.

## Statistical factories
<!-- 10.4 -->

__[Q1]{.Q}__: In `boot_model()`, why don't I need to force the evaluation of `df` or `model`?

__[A]{.solved}__: `boot_model()` ultimately returns a function, and whenever you return a function you need to make sure all the inputs are explicitly evaluated. Here that happens automatically because we use `df` and `formula` in `lm()` before returning the function.

```{r}
boot_model <- function(df, formula) {
  mod <- lm(formula, data = df)
  fitted <- unname(fitted(mod))
  resid <- unname(resid(mod))
  rm(mod)
  
  function() {
    fitted + sample(resid)
  }
} 
```

__[Q2]{.Q}__: Why might you formulate the Box-Cox transformation like this?

```{r}
boxcox3 <- function(x) {
  function(lambda) {
    if (lambda == 0) {
      log(x)
    } else {
      (x ^ lambda - 1) / lambda
    }
  }  
}
```

__[A]{.solved}__: `boxcox3()` returns a function where `x` is fixed (though it is not forced, so it may be manipulated later). This allows us to apply and test different transformations for different inputs and give them a descriptive name.

```{r, out.width = "49%", fig.show = "hold"}
boxcox_airpassengers <- boxcox3(AirPassengers)

plot(boxcox_airpassengers(0))
plot(boxcox_airpassengers(1))
plot(boxcox_airpassengers(2))
plot(boxcox_airpassengers(3))
```

__[Q3]{.Q}__: Why don't you need to worry that `boot_permute()` stores a copy of the data inside the function that it generates?

__[A]{.solved}__: `boot_permute()` is defined in *Advanced R* as:

```{r}
boot_permute <- function(df, var) {
  n <- nrow(df)
  force(var)
  
  function() {
    col <- df[[var]]
    col[sample(n, replace = TRUE)]
  }
}
```

We don't need to worry that it stores a copy of the data, because it actually doesn't store one; it's just a name that points to the same underlying object in memory.

```{r}
boot_mtcars1 <- boot_permute(mtcars, "mpg")

lobstr::obj_size(mtcars)
lobstr::obj_size(boot_mtcars1)
lobstr::obj_sizes(mtcars, boot_mtcars1)
```

__[Q4]{.Q}__: How much time does `ll_poisson2()` save compared to `ll_poisson1()`? Use `bench::mark()` to see how much faster the optimisation occurs. How does changing the length of `x` change the results?

__[A]{.solved}__: Let us recall the definitions of `ll_poisson1()`, `ll_poisson2()` and the test data `x1`:

```{r}
ll_poisson1 <- function(x) {
  n <- length(x)
  
  function(lambda) {
    log(lambda) * sum(x) - n * lambda - sum(lfactorial(x))
  }
}

ll_poisson2 <- function(x) {
  n <- length(x)
  sum_x <- sum(x)
  c <- sum(lfactorial(x))
  
  function(lambda) {
    log(lambda) * sum_x - n * lambda - c
  }
}

x1 <- c(41, 30, 31, 38, 29, 24, 30, 29, 31, 38)
```

A benchmark on `x1` reveals a performance improvement of factor 2 for `ll_poisson2()` over `ll_poisson1()`:

```{r}
bench::mark(
  llp1 = optimise(ll_poisson1(x1), c(0, 100), maximum = TRUE),
  llp2 = optimise(ll_poisson2(x1), c(0, 100), maximum = TRUE)
)
```

As the redundant calculations within `ll_poisson1()` become more expensive with growing length of `x1`, we expect even further relative performance improvements for `ll_poisson2()`. The following benchmark reveals a relative performance improvement of factor 20 for `ll_poisson2()` when `x1` is of length 100,000:

```{r, message = FALSE, warning = FALSE}
bench_poisson <- function(x_length) {
  x <- rpois(x_length, 100L)
  
  bench::mark(
    llp1 = optimise(ll_poisson1(x), c(0, 100), maximum = TRUE),
    llp2 = optimise(ll_poisson2(x), c(0, 100), maximum = TRUE),
    time_unit = "ms"
  )
}

performances <- map_dfr(10^(1:5), bench_poisson)

df_perf <- tibble(
  x_length = rep(10^(1:5), each = 2),
  method   = rep(attr(performances$expression, "description"), 5),
  median   = performances$median
)

ggplot(df_perf, aes(x_length, median, col = method)) +
  geom_point(size = 2) +
  geom_line(linetype = 2) +
  scale_x_log10() +
  labs(
    x = "Length of x",
    y = "Execution Time (ms)",
    color = "Method"
  ) +
  theme(legend.position = "top")
```

## Function factories + functionals
<!-- 10.5 -->

__[Q1]{.Q}__: Which of the following commands is equivalent to `with(x, f(z))`?

(a) `x$f(x$z)`.
(b) `f(x$z)`.
(c) `x$f(z)`.
(d) `f(z)`.
(e) It depends.

__[A]{.solved}__: (e) "It depends" is the correct answer. Usually `with()` is used with a data frame, so you'd usually expect (b), but if `x` is a list, it could be any of the options.

```{r}
f <- mean
z <- 1
x <- list(f = mean, z = 1)

identical(with(x, f(z)), x$f(x$z))
identical(with(x, f(z)), f(x$z))
identical(with(x, f(z)), x$f(z))
identical(with(x, f(z)), f(z))
```

__[Q2]{.Q}__: Compare and contrast the effects of `env_bind()` vs. `attach()` for the following code.

```{r}
funs <- list(
  mean = function(x) mean(x, na.rm = TRUE),
  sum = function(x) sum(x, na.rm = TRUE)
)

attach(funs)
mean <- function(x) stop("Hi!")
detach(funs)

env_bind(globalenv(), !!!funs)
mean <- function(x) stop("Hi!") 
env_unbind(globalenv(), names(funs))
```

__[A]{.solved}__: `attach()` adds `funs` to the search path. Therefore, the provided functions are found before their respective versions from the `{base}` package. Further, they cannot get accidentally overwritten by similar named functions in the global environment. One annoying downside of using `attach()` is the possibility to attach the same object multiple times, making it necessary to call `detach()` equally often.

```{r}
attach(funs)
attach(funs)

head(search())
detach(funs)
detach(funs)
```

In contrast `rlang::env_bind()` just adds the functions in `fun` to the global environment. No further side effects are introduced, and the functions are overwritten when similarly named functions are defined.

```{r}
env_bind(globalenv(), !!!funs)
head(search())
```