diff --git a/tutorial.R b/tutorial.R index 90f6a1c..077ffd9 100644 --- a/tutorial.R +++ b/tutorial.R @@ -122,11 +122,11 @@ x <- 1.9999999999999; x; x-2 x <- 1.99999999999999999; x; x-2 a <- 1; b <- 3; -c <- a < b -d <- (a > b) -c; d +c <- a < b; c +d <- (a > b); d -x <- 1:5; b <- (x<=3); b +x <- 1:5 +b <- (x<=3); b a=1:3 b=2:4 @@ -137,12 +137,12 @@ a==b a <- c(1,2,3,4) b <- c(1,1,5,5) (a3) -(a3) +## a3) course.url <- "https://kingaa.github.io/R_Tutorial/" X <- read.csv(paste0(course.url,"data/ChlorellaGrowth.csv"),comment.char='#') Light <- X[,1] -rmax <- X[,2]; +rmax <- X[,2] lowLight <- Light[Light<50] lowLightrmax <- rmax[Light<50] lowLight @@ -190,7 +190,8 @@ y <- 1:5; y z <- array(1:5,dim=5); z y==z identical(y,z) -dim(y); dim(z) +dim(y) +dim(z) ## x <- seq(1,27) ## dim(x) <- c(3,9) diff --git a/tutorial.Rmd b/tutorial.Rmd index b591d25..68aae14 100644 --- a/tutorial.Rmd +++ b/tutorial.Rmd @@ -795,9 +795,8 @@ For example, try: ```{r } a <- 1; b <- 3; -c <- a < b -d <- (a > b) -c; d +c <- a < b; c +d <- (a > b); d ``` The parentheses around `a > b` above are optional but do make the code easier to read. @@ -825,7 +824,8 @@ Table: Some comparison operators in **R**. Use `?Comparison` to learn more. When we compare two vectors or matrices, comparisons are done element-by-element (and the recycling rule applies). For example, ```{r } -x <- 1:5; b <- (x<=3); b +x <- 1:5 +b <- (x<=3); b ``` So if `x` and `y` are vectors, then `(x==y)` will return a vector of values giving the element-by-element comparisons. If you want to know whether `x` and `y` are identical vectors, use `identical(x,y)` or `all.equal(x,y)`. @@ -862,14 +862,15 @@ Table: Logical operators. ----------------------------------- -For example, try -```{r warning=TRUE,error=TRUE} +For example, execute the following and make sure you understand what happens. +```{r warning=TRUE} a <- c(1,2,3,4) b <- c(1,1,5,5) (a3) -(a3) ``` -and make sure you understand what happened. +```{r eval=FALSE} +a3) +``` The two forms of logical OR (`|`and `||`) are *inclusive*, meaning that `x|y` is true if either `x` or `y` or both are true. Use `xor` when exclusive OR is needed. @@ -887,7 +888,7 @@ As a simple example, we might want to focus on just the low-light values of $r_{ course.url <- "https://kingaa.github.io/R_Tutorial/" X <- read.csv(paste0(course.url,"data/ChlorellaGrowth.csv"),comment.char='#') Light <- X[,1] -rmax <- X[,2]; +rmax <- X[,2] ``` ```{r } lowLight <- Light[Light<50] @@ -1110,7 +1111,8 @@ y <- 1:5; y z <- array(1:5,dim=5); z y==z identical(y,z) -dim(y); dim(z) +dim(y) +dim(z) ``` ----------------------------------- diff --git a/tutorial.html b/tutorial.html index e0db8b7..5d6e61e 100644 --- a/tutorial.html +++ b/tutorial.html @@ -2867,7 +2867,7 @@

Aaron A. King, Stu Field, Ben Bolker, Steve Ellner

How to use this document

These notes contain many sample calculations. It is important to do these yourself—type them in at your keyboard and see what happens on your screen—to get the feel of working in R.

Exercises in the middle of a section should be done immediately when you get to them, and make sure you have them right before moving on.

-

For your convenience, the R codes for this document are provided in a script which you can download, edit, and run.

+

For your convenience, the R codes for this document are provided in a script which you can download, edit, and run.

Many other similar introductions are scattered around the web; a partial list is in the “contributed documentation” section on the R web site (https://cran.r-project.org/other-docs.html). This particular version is limited (it has similar coverage to the standard Introduction to R manual and targets biologists who are neither programmers nor statisticians (yet).

Throughout this document, Windows-specific items are marked with a Ⓦ and *nix (linux/unix/MacOSX)-specific items with a Ⓧ.

@@ -3621,13 +3621,12 @@
Exercise

Logical operations

Some operations return a logical value (i.e., TRUE or FALSE). For example, try:

a <- 1; b <- 3; 
-c <- a < b
-d <- (a > b)
-c; d
+c <- a < b; c

output

[1] TRUE
+
d <- (a > b); d

output

[1] FALSE
@@ -3679,21 +3678,22 @@

Logical operations


When we compare two vectors or matrices, comparisons are done element-by-element (and the recycling rule applies). For example,

-
x <- 1:5; b <- (x<=3); b
+
x <- 1:5
+b <- (x<=3); b

output

[1]  TRUE  TRUE  TRUE FALSE FALSE

So if x and y are vectors, then (x==y) will return a vector of values giving the element-by-element comparisons. If you want to know whether x and y are identical vectors, use identical(x,y) or all.equal(x,y). You can use ?Logical to read more about logical operations. Note the difference between = and ==. Can you figure out what happened in the following cautionary tale?

-
a=1:3
-b=2:4
-a==b
+
a=1:3
+b=2:4
+a==b

output

[1] FALSE FALSE FALSE
-
a=b
-a==b
+
a=b
+a==b

output

[1] TRUE TRUE TRUE
@@ -3737,17 +3737,15 @@

Logical operations


-

For example, try

-
a <- c(1,2,3,4)
-b <- c(1,1,5,5)
-(a<b) | (a>3)
+

For example, execute the following and make sure you understand what happens.

+
a <- c(1,2,3,4)
+b <- c(1,1,5,5)
+(a<b) | (a>3)

output

[1] FALSE FALSE  TRUE  TRUE
-
(a<b) || (a>3)
-
Error in (a < b) || (a > 3): 'length = 4' in coercion to 'logical(1)'
-

and make sure you understand what happened.

+
a<b) || (a>3)

The two forms of logical OR (|and ||) are inclusive, meaning that x|y is true if either x or y or both are true. Use xor when exclusive OR is needed. The two forms of AND and OR differ in how they handle vectors. The shorter one (|, &) does element-by-element comparisons; the longer one (||, &&) looks only at the first element in each vector.

@@ -4061,11 +4059,12 @@

Arrays

output

[1] FALSE
-
dim(y); dim(z)
+
dim(y)

output

NULL
+
dim(z)

output

[1] 5
@@ -4074,10 +4073,10 @@

Arrays

Exercise

What happens when we set the dimension attribute on a vector? For example:

-
x <- seq(1,27)
-dim(x) <- c(3,9)
-is.array(x)
-is.matrix(x)
+
x <- seq(1,27)
+dim(x) <- c(3,9)
+is.array(x)
+is.matrix(x)

Make sure you understand the above.


@@ -4085,21 +4084,21 @@
Exercise

Factors

For dealing with measurements on the nominal and ordinal scales (Stevens 1946), R provides vectors of type factor. A factor is a variable that can take one of a finite number of distinct levels. To construct a factor, we can apply the factor function to a vector of any class:

-
x <- rep(c(1,2),each=3); factor(x)
+
x <- rep(c(1,2),each=3); factor(x)

output

[1] 1 1 1 2 2 2
 Levels: 1 2
-
trochee <- c("jetpack","ferret","pizza","lawyer")
-trochee <- factor(trochee); trochee
+
trochee <- c("jetpack","ferret","pizza","lawyer")
+trochee <- factor(trochee); trochee

output

[1] jetpack ferret  pizza   lawyer 
 Levels: ferret jetpack lawyer pizza

By default, factor sets the levels to the unique set of values taken by the vector. To modify that behavior, there is the levels argument:

-
factor(trochee,levels=c("ferret","pizza","cowboy","scrapple"))
+
factor(trochee,levels=c("ferret","pizza","cowboy","scrapple"))

output

[1] <NA>   ferret pizza  <NA>  
@@ -4107,15 +4106,15 @@ 

Factors

Note that the order of the levels is arbitrary, in keeping with the fact that the only operation permissible on the nominal scale is the test for equality. In particular, the factors created with the factor command are un-ordered: there is no sense in which we can ask whether, e.g., ferret < cowboy.

To represent variables measured on the ordinal scale, R provides ordered factors, constructed via the ordered function. An ordered factor is just like an un-ordered factor except that the order of the levels matters:

-
x <- ordered(sample(x=letters,size=22,replace=TRUE)); x
+
x <- ordered(sample(x=letters,size=22,replace=TRUE)); x

output

 [1] z m u p a g c d b n x d g n w l c t z p b x
 14 Levels: a < b < c < d < g < l < m < n < p < t < ... < z

Here, we’ve relied on ordered’s default behavior, which is to put the levels in alphabetical order. It’s typically safer to specify explicitly what order we want:

-
x <- ordered(x,levels=rev(letters))
-x[1:5] < x[18:22]
+
x <- ordered(x,levels=rev(letters))
+x[1:5] < x[18:22]

output

[1]  TRUE FALSE  TRUE  TRUE FALSE
@@ -4140,9 +4139,9 @@
Exercise

Lists

While vectors and matrices may seem familiar, lists may be new to you. Vectors and matrices have to contain elements that are all the same type: lists in R can contain anything—vectors, matrices, other lists, arbitrary objects. Indexing is a little different too: use [[ ]] to extract an element of a list by number or name or $ to extract an element by name (only). Given a list like this:

-
L <- list(A=x,B=trochee,C=c("a","b","c"))
+
L <- list(A=x,B=trochee,C=c("a","b","c"))

Then L$A, L[["A"]], and L[[1]] will each return the first element of the list. To extract a sublist, use the ordinary single square brackets: []:

-
L[c("B","C")]
+
L[c("B","C")]

output

$B
@@ -4159,9 +4158,9 @@ 

Data frames

Data frames are a hybrid of lists and vectors. Internally, they are a list of vectors which can be of different types but must all be the same length. However, they behave somewhat like matrices, in that you can do most things to them that you can do with matrices. You can index them either the way you would index a list, using [[ ]] or $—where each variable is a different item in the list—or the way you would index a matrix.
You can turn a data frame into a matrix (using as.matrix(), but only if all variables are of the same class) and a matrix into a data frame (using as.data.frame()).

When data are read into R from an external file using one of the read.xxx commands (read.csv, read.table, read.xls, etc.), the object that is created is a data frame.

-
data.url <- "https://kingaa.github.io/R_Tutorial/data/ChlorellaGrowth.csv"
-dat <- read.csv(data.url,comment.char='#')
-dat
+
data.url <- "https://kingaa.github.io/R_Tutorial/data/ChlorellaGrowth.csv"
+dat <- read.csv(data.url,comment.char='#')
+dat

output

 light rmax
@@ -4414,19 +4413,19 @@ 

Scripts and data files

  • If you send script files by e-mail, even if you are careful to send them as plain text, lines will occasionally get broken in different places, which can lead to confusion. Beware.

  • As a first example, the file Intro1.R has the commands from the interactive regression analysis. Download Intro1.R and save it to your computer:

    -
    course.url <- "https://kingaa.github.io/R_Tutorial/"
    -download.file(paste0(course.url,"Intro1.R"),destfile="Intro1.R",mode="w")
    +
    course.url <- "https://kingaa.github.io/R_Tutorial/"
    +download.file(paste0(course.url,"Intro1.R"),destfile="Intro1.R",mode="w")

    Open your copy of Intro1.R. In your editor, select and Copy the entire text of the file, and then Paste the text into the R console window. This has the same effect as entering the commands by hand into the console: they will be executed and so a graph is displayed with the results. Cut-and-Paste allows you to execute script files one piece at a time (which is useful for finding and fixing errors). The source function allows you to run an entire script file, e.g.,

    -
    source("Intro1.R")
    +
    source("Intro1.R")

    Another important time-saver is loading data from a text file. Grab copies of Intro2.R and ChlorellaGrowth.csv from the dropsite to see how this is done.

    -
    download.file(paste0(course.url,"Intro2.R"),destfile="Intro2.R",mode="w")
    -download.file(paste0(course.url,"data/ChlorellaGrowth.csv"),destfile="ChlorellaGrowth.csv",mode="w")
    +
    download.file(paste0(course.url,"Intro2.R"),destfile="Intro2.R",mode="w")
    +download.file(paste0(course.url,"data/ChlorellaGrowth.csv"),destfile="ChlorellaGrowth.csv",mode="w")

    In ChlorellaGrowth.csv the two variables are entered as columns of a data matrix. Then instead of typing these in by hand, the command

    -
    X <- read.csv("ChlorellaGrowth.csv",comment.char='#')
    +
    X <- read.csv("ChlorellaGrowth.csv",comment.char='#')

    reads the file (from the current directory) and puts the data values into the variable X. Note that as specified above you need to make sure that R is looking for the data file in the right place: make sure to download the data file into your current working directory. Note also the comment.char option in the call to read.csv; this tells R to ignore lines that begin with a # and allows us to use self-documenting data files (a very good practice).

    Extract the variables from X with the commands

    -
    Light <- X[,1]
    -rmax <- X[,2]
    +
    Light <- X[,1]
    +rmax <- X[,2]

    Think of these as shorthand for “Light = everything in column 1 of X”, and “rmax = everything in column 2 of X” (we’ll learn about working with data frames later). From there on out it’s the same as before, with some additions that set the axis labels and add a title.


    @@ -4474,27 +4473,27 @@

    Looping in R

    for loops

    Execute the following code.

    -
    phi <- 1
    -for (k in 1:100) {
    -  phi <- 1+1/phi
    -  print(c(k,phi))
    -}
    +
    phi <- 1
    +for (k in 1:100) {
    +  phi <- 1+1/phi
    +  print(c(k,phi))
    +}

    What does it do? Sequentially, for each value of k between 1 and 100, phi is modified. More specifically, at the beginning of the for loop, a vector containing all the integers from 1 to 100 in order is created. Then, k is set to the first element in that vector, i.e., 1. Then the R expression from the { to the } is evaluated. When that expression has been evaluated, k is set to the next value in the vector. The process is repeated until, at the last evaluation, k has value 100.

    As an aside, note that the final value of phi is the Golden Ratio, 1.618034.

    As an example of a situation where a loop of some sort is really needed, suppose we wish to iterate the Beverton-Holt map (one of the simplest discrete-time models of population growth), \[\begin{equation*} N_{t+1} = \frac{a\,N_t}{1+b\,N_t}. \end{equation*}\] We simply have no option but to do the calculation one step at a time. Here’s an R code that does this

    -
    a <- 1.1
    -b <- 0.001
    -T <- seq(from=1,to=200,by=1)
    -N <- numeric(length(T))
    -n <- 2
    -for (t in T) {
    -  n <- a*n/(1+b*n)
    -  N[t] <- n
    -}
    +
    a <- 1.1
    +b <- 0.001
    +T <- seq(from=1,to=200,by=1)
    +N <- numeric(length(T))
    +n <- 2
    +for (t in T) {
    +  n <- a*n/(1+b*n)
    +  N[t] <- n
    +}

    Spend some time to make sure you understand what happens at each line of the above. We can plot the population sizes \(N_t\) through time via

    -
    plot(T,N)
    +
    plot(T,N)

    plot

    @@ -4502,11 +4501,11 @@

    for loops

    Gotcha:

    An alternative way to do the above might be something like

    -
    N <- numeric(length(T))
    -for (t in 1:length(T)) {
    -  n <- a*n/(1+b*n)
    -  N[t] <- n
    -}
    +
    N <- numeric(length(T))
    +for (t in 1:length(T)) {
    +  n <- a*n/(1+b*n)
    +  N[t] <- n
    +}

    @@ -4518,41 +4517,41 @@
    Exercise

    while loops

    A second looping construct is the while loop. Using while, we can compute the Golden Ratio as before:

    -
    phi <- 20
    -k <- 1
    -while (k <= 100) {
    -  phi <- 1+1/phi
    -  print(c(k,phi))
    -  k <- k+1
    -}
    -

    What’s going on here? First, phi and k are initialized. Then the while loop is started. At each iteration of the loop, phi is modified, and intermediate results printed, as before. In addition, k is incremented. The while loop continues to iterate until the condition k <= 100 is no longer TRUE, at which point, the while loop terminates.

    -

    Note that here we’ve chosen a large number (100) of iterations. Perhaps we could get by with fewer. If we wanted to terminate the iterations as soon as the value of phi stopped changing, we could do:

    phi <- 20
    -conv <- FALSE
    -while (!conv) {
    -  phi.new <- 1+1/phi
    -  conv <- phi==phi.new
    -  phi <- phi.new
    +k <- 1
    +while (k <= 100) {
    +  phi <- 1+1/phi
    +  print(c(k,phi))
    +  k <- k+1
     }
    +

    What’s going on here? First, phi and k are initialized. Then the while loop is started. At each iteration of the loop, phi is modified, and intermediate results printed, as before. In addition, k is incremented. The while loop continues to iterate until the condition k <= 100 is no longer TRUE, at which point, the while loop terminates.

    +

    Note that here we’ve chosen a large number (100) of iterations. Perhaps we could get by with fewer. If we wanted to terminate the iterations as soon as the value of phi stopped changing, we could do:

    +
    phi <- 20
    +conv <- FALSE
    +while (!conv) {
    +  phi.new <- 1+1/phi
    +  conv <- phi==phi.new
    +  phi <- phi.new
    +}

    Exercise

    Verify that the above works as intended. How many iterations are needed?


    Another way to accomplish this would be to use break to stop the iteration when a condition is met. For example

    -
    phi <- 20
    -while (TRUE) {
    -  phi.new <- 1+1/phi
    -  if (phi==phi.new) break
    -  phi <- phi.new
    -}
    -

    While this while loop is equivalent to the one before, it does have the drawback that, if the break condition is never met, the loop will go on indefinitely. An alternative that avoids this is to use a for loop with a large (but finite) number of iterations, together with break:

    -
    phi <- 3
    -for (k in seq_len(1000)) {
    +
    phi <- 20
    +while (TRUE) {
       phi.new <- 1+1/phi
       if (phi==phi.new) break
       phi <- phi.new
     }
    +

    While this while loop is equivalent to the one before, it does have the drawback that, if the break condition is never met, the loop will go on indefinitely. An alternative that avoids this is to use a for loop with a large (but finite) number of iterations, together with break:

    +
    phi <- 3
    +for (k in seq_len(1000)) {
    +  phi.new <- 1+1/phi
    +  if (phi==phi.new) break
    +  phi <- phi.new
    +}

    @@ -4564,12 +4563,12 @@
    Exercise

    repeat loops

    A third looping construct in R involves the repeat keyword. For example,

    -
    phi <- 12
    -repeat {
    -  phi.new <- 1/(1+phi)
    -  if (phi==phi.new) break
    -  phi <- phi.new
    -}
    +
    phi <- 12
    +repeat {
    +  phi.new <- 1/(1+phi)
    +  if (phi==phi.new) break
    +  phi <- phi.new
    +}

    In addition, R provides the next keyword, which, like break, is used in the body of a looping construct. Rather than terminating the iteration, however, it aborts the current iteration and leads to the immediate execution of the next iteration.

    For more information on looping and other control-flow constructs, execute ?Control.

    @@ -4580,9 +4579,9 @@

    Functions and environments

    Definitions and examples

    An extremely useful feature in R is the ability to write arbitrary functions. A function, in this context, is an algorithm that performs a specific computation that depends on inputs (the function’s arguments) and produces some output (the function’s value) and/or has some side effects. Let’s see how this is done.

    Here is a function that squares a number.

    -
    sq <- function (x) x^2
    +
    sq <- function (x) x^2

    The syntax is function (arglist) expr. The one argument in this case is x. When a particular value of x is supplied, R performs the squaring operation. The function then returns the value x^2:

    -
    sq(3); sq(9); sq(-2);
    +
    sq(3); sq(9); sq(-2);

    output

    [1] 9
    @@ -4596,12 +4595,12 @@

    Definitions and examples

    [1] 4

    Here is a function with two arguments and a more complex body, as we call the expression that is evaluated when the function is called.

    -
    f <- function (x, y = 3) {
    -  a <- sq(x)
    -  a+y
    -}
    +
    f <- function (x, y = 3) {
    +  a <- sq(x)
    +  a+y
    +}

    Here, the body is the R expression from { to }. Unless the return codeword is used elsewhere in the function body, the value returned is always the last expression evaluated. Thus:

    -
    f(3,0); f(2,2); f(3);
    +
    f(3,0); f(2,2); f(3);

    output

    [1] 9
    @@ -4616,23 +4615,23 @@

    Definitions and examples

    Note that in the last case, only one argument was supplied. In this case, y assumed its default value, 3.

    Note that functions need not be assigned to symbols; they can be anonymous:

    -
    function (x) x^5
    +
    function (x) x^5

    output

    function (x) x^5
    -
    (function (x) x^5)(2)
    +
    (function (x) x^5)(2)

    output

    [1] 32

    A function can also have side effects, e.g.,

    -
    hat <- "hat"
    -hattrick <- function (y) {
    -  hat <<- "rabbit"
    -  2*y
    -}
    -hat; hattrick(5); hat
    +
    hat <- "hat"
    +hattrick <- function (y) {
    +  hat <<- "rabbit"
    +  2*y
    +}
    +hat; hattrick(5); hat

    output

    [1] "hat"
    @@ -4649,12 +4648,12 @@

    Definitions and examples

    An aside

    If we want the function not to automatically print, we can wrap the return value in invisible():

    -
    hattrick <- function (y) {
    -  hat <<- "rabbit"
    -  invisible(2*y)
    -}
    -hattrick(5)
    -print(hattrick(5))
    +
    hattrick <- function (y) {
    +  hat <<- "rabbit"
    +  invisible(2*y)
    +}
    +hattrick(5)
    +print(hattrick(5))

    output

    [1] 10
    @@ -4666,12 +4665,12 @@
    An aside
  • its environment, i.e., the context in which the function was defined.
  • R provides simple functions to interrogate these function components:

    -
    formals(hattrick)
    +
    formals(hattrick)

    output

    $y
    -
    body(hattrick)
    +
    body(hattrick)

    output

    {
    @@ -4679,7 +4678,7 @@ 
    An aside
    invisible(2 * y) }
    -
    environment(hattrick)
    +
    environment(hattrick)

    output

    <environment: R_GlobalEnv>
    @@ -4691,17 +4690,17 @@

    Function scope

    Note: This section draws heavily, and sometimes verbatim, on §10.7 of the Introduction to R manual (R Core Team 2014a).

    As noted above, a paramount consideration in the implementation of functions in any programming language is that unintentional side effects should never occur. In particular, I should be free to write a function that creates temporary variables as an aid to its computations, and be able to rest assured that no variables I create temporarily will interfere with any other variables I’ve defined anywhere else. To accomplish this, R has a specific set of scoping rules.

    Consider the function

    -
    f <- function (x) {
    -  y <- 2*x
    -  print(x)
    -  print(y)
    -  print(z)
    -}
    +
    f <- function (x) {
    +  y <- 2*x
    +  print(x)
    +  print(y)
    +  print(z)
    +}

    In this function’s body, x is a formal parameter, y is a local variable, and z is a free, or unbound variable. When f is evaluated, each of these variables must be bound to some value. In R , the free variable bindings are resolved—each time the function is evaluated—by first looking in the environment where the function was created. This is called lexical scope. Thus if we execute

    -
    f(3)
    +
    f(3)

    we get an error, because no object named z can be found. If, however, we do

    -
    z <- 10
    -f(3)
    +
    z <- 10
    +f(3)

    output

    [1] 3
    @@ -4709,15 +4708,15 @@ 

    Function scope

    [1] 10

    we don’t get an error, because z is defined in the environment, <environment: R_GlobalEnv>, of f. Similarly, when we do

    -
    z <- 13
    -g <- function (x) {
    -  2*x+z
    -}
    -f <- function (x) {
    -  z <- -100
    -  g(x)
    -}
    -f(5)  
    +
    z <- 13
    +g <- function (x) {
    +  2*x+z
    +}
    +f <- function (x) {
    +  z <- -100
    +  g(x)
    +}
    +f(5)  

    output

    [1] 23
    @@ -4728,12 +4727,12 @@

    Function scope

    Nested functions and environments

    In each of the following examples, make sure you understand exactly what has happened.

    Consider this:

    -
    y <- 11
    -f <- function (x) {
    -  y <- 2*x
    -  y+x
    -}
    -f(1); y
    +
    y <- 11
    +f <- function (x) {
    +  y <- 2*x
    +  y+x
    +}
    +f(1); y

    output

    [1] 3
    @@ -4748,11 +4747,11 @@
    Exercise

    Why hasn’t y changed its value?


    What about this?

    -
    y <- 0
    -f <- function (x) {
    -  2*x+y
    -}
    -f(1); y
    +
    y <- 0
    +f <- function (x) {
    +  2*x+y
    +}
    +f(1); y

    output

    [1] 2
    @@ -4768,14 +4767,14 @@
    Exercise

    Why is the return value of f different?


    How about the following?

    -
    g <- function (y) y
    -f <- function (x) {
    -  g <- function (y) {
    -    2*y
    -  }
    -  11+g(x)
    -}
    -f(2); g(2)
    +
    g <- function (y) y
    +f <- function (x) {
    +  g <- function (y) {
    +    2*y
    +  }
    +  11+g(x)
    +}
    +f(2); g(2)

    output

    [1] 15
    @@ -4791,15 +4790,15 @@
    Exercise

    Be sure you understand what’s happening at each line and why you get the results you see.


    Another example:

    -
    y <- 11
    -f <- function (x) {
    -  y <- 22
    -  g <- function (x) {
    -    x+y
    -  }
    -  g(x)
    -}
    -f(1); y
    +
    y <- 11
    +f <- function (x) {
    +  y <- 22
    +  g <- function (x) {
    +    x+y
    +  }
    +  g(x)
    +}
    +f(1); y

    output

    [1] 23
    @@ -4815,16 +4814,16 @@
    Exercise

    Which value of y was used?


    Compare that with this:

    -
    y <- 11
    -f <- function (x) {
    -  y <- 22
    -  g <- function (x) {
    -    y <<- 7
    -    x+y
    -  }
    -  c(g(x),y)
    -}
    -f(1); y
    +
    y <- 11
    +f <- function (x) {
    +  y <- 22
    +  g <- function (x) {
    +    y <<- 7
    +    x+y
    +  }
    +  c(g(x),y)
    +}
    +f(1); y

    output

    [1] 8 7
    @@ -4834,14 +4833,14 @@
    Exercise
    [1] 11

    and this:

    -
    y <- 11
    -f <- function (x) {
    -  g <- function (x) {
    -    x+y
    -  }
    -  g(x)
    -}
    -f(1); y
    +
    y <- 11
    +f <- function (x) {
    +  g <- function (x) {
    +    x+y
    +  }
    +  g(x)
    +}
    +f(1); y

    output

    [1] 12
    @@ -4851,17 +4850,17 @@
    Exercise
    [1] 11

    and this:

    -
    f <- function (x) {
    -  y <- 37
    -  g <- function (x) {
    -    h <- function (x) {
    -      x+y
    -    }
    -    h(x)
    -  }
    -  g(x)
    -}
    -f(1); y
    +
    f <- function (x) {
    +  y <- 37
    +  g <- function (x) {
    +    h <- function (x) {
    +      x+y
    +    }
    +    h(x)
    +  }
    +  g(x)
    +}
    +f(1); y

    output

    [1] 38
    @@ -4871,15 +4870,15 @@
    Exercise
    [1] 11

    Finally, consider the following:

    -
    rm(y)
    -f <- function (x) {
    -  g <- function (x) {
    -    y <<- 2*x
    -    y+1
    -  }
    -  g(x)+1
    -}
    -f(11); y
    +
    rm(y)
    +f <- function (x) {
    +  g <- function (x) {
    +    y <<- 2*x
    +    y+1
    +  }
    +  g(x)+1
    +}
    +f(11); y

    output

    [1] 24
    @@ -4888,7 +4887,7 @@
    Exercise

    output

    [1] 22
    -
    f(-2); y
    +
    f(-2); y

    output

    [1] -2
    @@ -4904,60 +4903,60 @@
    Exercise

    In the above five code snippets, make sure you understand the differences.


    As mentioned above, each function is associated with an environment: the environment within which it was defined. When a function is evaluated, a new temporary environment is created, within which the function’s calculations are performed. Every new environment has a parent, the environment wherein it was created. The parent of this new environment is the function’s environment. To see this, try

    -
    f <- function () {
    -  g <- function () {
    -    h <- function () {
    -      cat("inside function h:\n")
    -      cat("current env: ")
    -      print(environment())
    -      cat("parent env: ")
    -      print(parent.frame(1))
    -      cat("grandparent env: ")
    -      print(parent.frame(2))
    -      cat("great-grandparent env: ")
    -      print(parent.frame(3))
    -      invisible(NULL)
    -    }
    -    cat("inside function g:\n")
    -    cat("environment of h: ")
    -    print(environment(h))
    -    cat("current env: ")
    -    print(environment())
    -    cat("parent env: ")
    -    print(parent.frame(1))
    -    cat("grandparent env: ")
    -    print(parent.frame(2))
    -    h()
    -  }
    -  cat("inside function f:\n")
    -  cat("environment of g: ")
    -  print(environment(g))
    -  cat("current env: ")
    -  print(environment())
    -  cat("parent env: ")
    -  print(parent.frame(1))
    -  g()
    -}
    -cat("environment of f: "); print(environment(f))
    -cat("global env: "); print(environment())
    -f()
    +
    f <- function () {
    +  g <- function () {
    +    h <- function () {
    +      cat("inside function h:\n")
    +      cat("current env: ")
    +      print(environment())
    +      cat("parent env: ")
    +      print(parent.frame(1))
    +      cat("grandparent env: ")
    +      print(parent.frame(2))
    +      cat("great-grandparent env: ")
    +      print(parent.frame(3))
    +      invisible(NULL)
    +    }
    +    cat("inside function g:\n")
    +    cat("environment of h: ")
    +    print(environment(h))
    +    cat("current env: ")
    +    print(environment())
    +    cat("parent env: ")
    +    print(parent.frame(1))
    +    cat("grandparent env: ")
    +    print(parent.frame(2))
    +    h()
    +  }
    +  cat("inside function f:\n")
    +  cat("environment of g: ")
    +  print(environment(g))
    +  cat("current env: ")
    +  print(environment())
    +  cat("parent env: ")
    +  print(parent.frame(1))
    +  g()
    +}
    +cat("environment of f: "); print(environment(f))
    +cat("global env: "); print(environment())
    +f()

    output

    environment of f: <environment: R_GlobalEnv>
     global env: <environment: R_GlobalEnv>
     inside function f:
    -environment of g: <environment: 0x55ffbfb892b0>
    -current env: <environment: 0x55ffbfb892b0>
    +environment of g: <environment: 0x55b266ba0018>
    +current env: <environment: 0x55b266ba0018>
     parent env: <environment: R_GlobalEnv>
     inside function g:
    -environment of h: <environment: 0x55ffbce116f8>
    -current env: <environment: 0x55ffbce116f8>
    -parent env: <environment: 0x55ffbfb892b0>
    +environment of h: <environment: 0x55b2632c3e20>
    +current env: <environment: 0x55b2632c3e20>
    +parent env: <environment: 0x55b266ba0018>
     grandparent env: <environment: R_GlobalEnv>
     inside function h:
    -current env: <environment: 0x55ffbce1bb90>
    -parent env: <environment: 0x55ffbce116f8>
    -grandparent env: <environment: 0x55ffbfb892b0>
    +current env: <environment: 0x55b2632b2aa8>
    +parent env: <environment: 0x55b2632c3e20>
    +grandparent env: <environment: 0x55b266ba0018>
     great-grandparent env: <environment: R_GlobalEnv>

    Each variable referenced in the function’s body is bound, first, to a formal argument if possible. If a local variable of that name has previously been created (via one of the assignment operators <-, ->, or =, this is the variable that is affected by any subsequent assignments. If the variable is neither a formal parameter nor a local variable, then the parent environment of the function is searched for that variable. If the variable has not been found in the parent environment, then the grand-parent environment is searched, and so on.

    @@ -4971,9 +4970,9 @@

    The apply family of functions

    List apply: lapply

    lapply applies a function to each element of a list or vector, returning a list.

    -
    x <- list("teenage","mutant","ninja","turtle",
    -          "hamster","plumber","pickle","baby")
    -lapply(x,nchar)
    +
    x <- list("teenage","mutant","ninja","turtle",
    +          "hamster","plumber","pickle","baby")
    +lapply(x,nchar)

    output

    [[1]]
    @@ -5000,9 +4999,9 @@ 

    List apply: lapply

    [[8]] [1] 4
    -
    y <- c("teenage","mutant","ninja","turtle",
    -       "hamster","plumber","pickle","baby")
    -lapply(y,nchar)
    +
    y <- c("teenage","mutant","ninja","turtle",
    +       "hamster","plumber","pickle","baby")
    +lapply(y,nchar)

    output

    [[1]]
    @@ -5033,16 +5032,16 @@ 

    List apply: lapply

    Sloppy list apply: sapply

    sapply isn’t content to always return a list: it attempts to simplify the results into a non-list vector if possible.

    -
    x <- list("pizza","monster","jigsaw","puddle",
    -          "hamster","plumber","pickle","baby")
    -sapply(x,nchar)
    +
    x <- list("pizza","monster","jigsaw","puddle",
    +          "hamster","plumber","pickle","baby")
    +sapply(x,nchar)

    output

    [1] 5 7 6 6 7 7 6 4
    -
    y <- c("pizza","monster","jigsaw","puddle",
    -       "hamster","plumber","pickle","baby")
    -sapply(y,nchar)
    +
    y <- c("pizza","monster","jigsaw","puddle",
    +       "hamster","plumber","pickle","baby")
    +sapply(y,nchar)

    output

      pizza monster  jigsaw  puddle hamster plumber  pickle 
    @@ -5054,9 +5053,9 @@ 

    Sloppy list apply: sapply

    Multiple-list apply: mapply

    mapply is a multiple-argument version of sapply:

    -
    x <- c("pizza","monster","jigsaw","puddle")
    -y <- c("cowboy","barbie","slumber","party")
    -mapply(paste,x,y,sep="/")
    +
    x <- c("pizza","monster","jigsaw","puddle")
    +y <- c("cowboy","barbie","slumber","party")
    +mapply(paste,x,y,sep="/")

    output

               pizza          monster           jigsaw 
    @@ -5065,7 +5064,7 @@ 

    Multiple-list apply: mapply

    "puddle/party"

    As usual, the recycling rule applies:

    -
    mapply(paste,x,y[2:3])
    +
    mapply(paste,x,y[2:3])

    output

                pizza           monster            jigsaw 
    @@ -5073,7 +5072,7 @@ 

    Multiple-list apply: mapply

    puddle "puddle slumber"
    -
    mapply(paste,x[c(1,3)],y)
    +
    mapply(paste,x[c(1,3)],y)

    output

              pizza          jigsaw            <NA> 
    @@ -5085,7 +5084,7 @@ 

    Multiple-list apply: mapply

    Array apply: apply

    apply is very powerful and a bit more complex. It allows an arbitrary function to applied to each slice of an array, where the slices can be defined in all possible ways. Let’s create a matrix:

    -
    A <- array(data=seq_len(15),dim=c(3,5)); A
    +
    A <- array(data=seq_len(15),dim=c(3,5)); A

    output

         [,1] [,2] [,3] [,4] [,5]
    @@ -5094,19 +5093,19 @@ 

    Array apply: apply

    [3,] 3 6 9 12 15

    To apply an operation to each row, we marginalize over the first dimension (rows). For example, to sum the rows, we’d do

    -
    apply(A,1,sum)
    +
    apply(A,1,sum)

    output

    [1] 35 40 45

    To sum the columns (the second dimension), we’d do

    -
    apply(A,2,sum)
    +
    apply(A,2,sum)

    output

    [1]  6 15 24 33 42

    Now suppose we have a 3-dimensional array:

    -
    A <- array(data=seq_len(30),dim=c(3,5,2)); A
    +
    A <- array(data=seq_len(30),dim=c(3,5,2)); A

    output

    , , 1
    @@ -5124,7 +5123,7 @@ 

    Array apply: apply

    [3,] 18 21 24 27 30

    To sum the rows within each slice, we’d do

    -
    apply(A,c(1,3),sum)
    +
    apply(A,c(1,3),sum)

    output

         [,1] [,2]
    @@ -5133,14 +5132,14 @@ 

    Array apply: apply

    [3,] 45 120

    while to sum the slices, we’d do

    -
    apply(A,3,sum)
    +
    apply(A,3,sum)

    output

    [1] 120 345

    For each of the above, make sure you understand exactly what has happened.

    Of course, we can apply an anonymous function wherever we apply a named function:

    -
    apply(A,c(2,3),function (x) sd(x)/sqrt(length(x)))
    +
    apply(A,c(2,3),function (x) sd(x)/sqrt(length(x)))

    output

              [,1]      [,2]
    @@ -5151,7 +5150,7 @@ 

    Array apply: apply

    [5,] 0.5773503 0.5773503

    If any additional arguments are given to apply, these are passed to the function. For example, make sure you understand what is happening in the lines:

    -
    apply(A,c(1,2),function (x, y) sum(x>y),y=8)
    +
    apply(A,c(1,2),function (x, y) sum(x>y),y=8)

    output

         [,1] [,2] [,3] [,4] [,5]
    @@ -5159,7 +5158,7 @@ 

    Array apply: apply

    [2,] 1 1 1 2 2 [3,] 1 1 2 2 2
    -
    apply(A,c(1,2),function (x, y) sum(x>y),y=-1)
    +
    apply(A,c(1,2),function (x, y) sum(x>y),y=-1)

    output

         [,1] [,2] [,3] [,4] [,5]
    @@ -5171,9 +5170,9 @@ 

    Array apply: apply

    Table apply: tapply

    tapply is, in a way, an extension of table. The syntax is tapply(X,INDEX,FUN,...), where X is a vector, INDEX is a list of one or more factors, each the same length as X, and FUN is a function. The vector X will be split into subvectors according to INDEX, and FUN will be applied to each of the subvectors. By default, the result is simplified into an array if possible. Some examples:

    -
    x <- seq(1,30,by=1)
    -b <- rep(letters[1:10],times=3)
    -data.frame(x,b)
    +
    x <- seq(1,30,by=1)
    +b <- rep(letters[1:10],times=3)
    +data.frame(x,b)

    output

      x b
    @@ -5208,14 +5207,14 @@ 

    Table apply: tapply

    29 i 30 j
    -
    tapply(x,b,sum)
    +
    tapply(x,b,sum)

    output

     a  b  c  d  e  f  g  h  i  j 
     33 36 39 42 45 48 51 54 57 60 
    -
    b <- rep(letters[1:10],each=3)
    -data.frame(x,b)
    +
    b <- rep(letters[1:10],each=3)
    +data.frame(x,b)

    output

      x b
    @@ -5250,18 +5249,18 @@ 

    Table apply: tapply

    29 j 30 j
    -
    tapply(x,b,sum)
    +
    tapply(x,b,sum)

    output

     a  b  c  d  e  f  g  h  i  j 
      6 15 24 33 42 51 60 69 78 87 

    The following uses the seed predation data described by Bolker (2008).

    -
    datafile <- "https://kingaa.github.io/R_Tutorial/data/seedpred.dat"
    -seeds <- read.table(datafile,header=TRUE,
    -                    colClasses=c(station='factor',dist='factor',date='Date'))
    -x <- subset(seeds,available>0)
    -with(x, tapply(tcum,list(dist,station),max,na.rm=TRUE))
    +
    datafile <- "https://kingaa.github.io/R_Tutorial/data/seedpred.dat"
    +seeds <- read.table(datafile,header=TRUE,
    +                    colClasses=c(station='factor',dist='factor',date='Date'))
    +x <- subset(seeds,available>0)
    +with(x, tapply(tcum,list(dist,station),max,na.rm=TRUE))

    output

        1  10 100 101 102 103 104 105 106 107 108 109 11 110
    @@ -5302,14 +5301,14 @@ 

    Table apply: tapply

    sapply with expected result: vapply

    When we could use sapply and we know exactly what the size and class of the value of the function will be, it is sometimes faster to use vapply. The syntax is like that of sapply: vapply(X,FUN,FUN.VALUE,...), where X and FUN are as in sapply, but we specify the size and class of the value of FUN via the FUN.VALUE argument. For example, suppose we define a function that, given a number between 1 and 26 will return the corresponding letter of the alphabet:

    -
    alph <- function (x) {
    -  stopifnot(x >= 1 && x <= 26)
    -  LETTERS[as.integer(x)]
    -}
    +
    alph <- function (x) {
    +  stopifnot(x >= 1 && x <= 26)
    +  LETTERS[as.integer(x)]
    +}

    This function will return a vector of length 1 and class character. To apply it to a randomly sampled set of integers, we might do

    -
    x <- sample(1:26,50,replace=TRUE)
    -y <- vapply(x,alph,character(1))
    -paste(y,collapse="")
    +
    x <- sample(1:26,50,replace=TRUE)
    +y <- vapply(x,alph,character(1))
    +paste(y,collapse="")

    output

    [1] "ORIFWKBWAMEJBZUMLKRJVEOIKQXEYRNTGAXXXUVQTZIHQGANTI"
    @@ -5321,30 +5320,30 @@

    sapply with expected result: vapply

    Vectorized functions vs loops

    As Ligges & Fox (Ligges and Fox 2008) point out, the idea that one should avoid loops wherever possible in R, using instead vectorized functions like those in the apply family, is quite widespread in some quarters. The belief, which probably dates back to infelicities in early versions of S and Splus but is remarkably persistent, is that loops are very slow in R. Let’s have a critical look at this.

    Consider the following loop code that can be vectorized:

    -
    x <- runif(n=1e6,min=0,max=2*pi)
    -y <- numeric(length(x))
    -for (k in seq_along(x)) {
    -  y[k] <- sin(x[k])
    -}
    -

    To time this, we can wrap the vectorizable parts in a call to system.time:

    x <- runif(n=1e6,min=0,max=2*pi)
    -system.time({
    -  y <- numeric(length(x))
    -  for (k in seq_along(x)) {
    -    y[k] <- sin(x[k])
    -  }
    -})
    +y <- numeric(length(x)) +for (k in seq_along(x)) { + y[k] <- sin(x[k]) +}
    +

    To time this, we can wrap the vectorizable parts in a call to system.time:

    +
    x <- runif(n=1e6,min=0,max=2*pi)
    +system.time({
    +  y <- numeric(length(x))
    +  for (k in seq_along(x)) {
    +    y[k] <- sin(x[k])
    +  }
    +})

    output

       user  system elapsed 
    -  0.092   0.000   0.093 
    + 0.095 0.000 0.095

    We can compare this with a simple call to sin (which is vectorized):

    -
    system.time(z <- sin(x))
    +
    system.time(z <- sin(x))

    output

       user  system elapsed 
    -  0.020   0.000   0.021 
    + 0.023 0.000 0.023

    Clearly, calling sin directly is much faster. What about using sapply?


    @@ -5353,55 +5352,55 @@
    Exercise

    Compare the time spent on the equivalent calculation, using sapply. What does this result tell you about where the computational cost is incurred?


    The above example is very simple in that there is a builtin function (sin in this case) which is capable of the fully vectorized computation. In such a case, it is clearly preferable to use it. Frequently, however, no such builtin function exists, i.e., we have a custom function of our own we want to apply to a set of data. Let’s compare the relative speeds of loops and sapply in this case.

    -
    x <- seq.int(from=20,to=1e6,by=10)
    -f <- function (x) {
    -  (((x+1)*x+1)*x+1)*x+1
    -}
    -system.time({
    -  res1 <- numeric(length(x))
    -  for (k in seq_along(x)) {
    -    res1[k] <- f(x[k])
    -  }
    -})
    +
    x <- seq.int(from=20,to=1e6,by=10)
    +f <- function (x) {
    +  (((x+1)*x+1)*x+1)*x+1
    +}
    +system.time({
    +  res1 <- numeric(length(x))
    +  for (k in seq_along(x)) {
    +    res1[k] <- f(x[k])
    +  }
    +})

    output

       user  system elapsed 
    -  0.069   0.000   0.069 
    + 0.071 0.000 0.071
    -
    system.time(res2 <- sapply(x,f))
    +
    system.time(res2 <- sapply(x,f))

    output

       user  system elapsed 
    -  0.072   0.000   0.071 
    + 0.078 0.001 0.078

    [Actually, in this case, f is vectorized automatically. Why is this?]

    -
    system.time(f(x))
    +
    system.time(f(x))

    output

       user  system elapsed 
    -  0.000   0.000   0.001 
    + 0.001 0.000 0.000

    Another example: in this case function g is not vectorized.

    -
    g <- function (x) {
    -  if ((x[1] > 30) && (x[1] < 5000)) 1 else 0
    -}
    -
    -system.time({
    -  res1 <- numeric(length(x))
    -  for (k in seq_along(x)) {
    -    res1[k] <- g(x[k])
    -  }
    -})
    +
    g <- function (x) {
    +  if ((x[1] > 30) && (x[1] < 5000)) 1 else 0
    +}
    +
    +system.time({
    +  res1 <- numeric(length(x))
    +  for (k in seq_along(x)) {
    +    res1[k] <- g(x[k])
    +  }
    +})

    output

       user  system elapsed 
    -  0.066   0.000   0.065 
    + 0.068 0.000 0.069
    -
    system.time(res2 <- sapply(x,g))
    +
    system.time(res2 <- sapply(x,g))

    output

       user  system elapsed 
    -   0.07    0.00    0.07 
    + 0.075 0.000 0.075