Note that in R, there are usually multiple ways to tackle the same problem. Subsetting, filtering, renaming, etc., can also be achieved via base R.
- R for data science (Wickham&Grolemund, 2017): a good guide to the tidyverse. Equally importantly, it can teach you about the phylosphy of tidy data, and the general “know-how” of data processing and analysis. The book has plenty of exercises for you to build “coding muscle”. You can find unofficial solutions tot he exercises here.
- Cheatsheets covering tidyverse topics, which you can find online, or in RStudio itself (Help > Cheatsheets).
- Crawley (2013): complete reference for basci coding, mathematics, statistics, etc.
We get messages related to conflicts and masking, since there are
functions names that are repeated across packages. Writing the package
that contains our function of interest, like this:
package::function()
, can make you more mindful about the specific
package and functions involved. Alternatively, you can load the
conflicted
library, which will throw an error message if you use a
function with a name that is repated across packages.
library(microbenchmark) # compare time of different functions
library(multcomp) # to get a compact letter display (cld) when microbenchmarking
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
##
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
##
## geyser
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
I will demonstrate most of the tidyverse functions using a simplified
version of the starwars
dataset, containing characteristics of 87
characters from the Star Wars franchise. Type ?starwars
if you want to
learn more about this tibble.
starwars
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Let’s simplify the dataset, removing the columns that contain lists (it
reduces the length of summary()
or str()
calls), changing the type
into a data frame, and keeping only the 10 first observations
starwars2 <-
starwars %>%
# where() is a selection helper to select columns based on functions that return TRUE or FALSE
# in this case, a function that returns TRUE or FALSE depending on whether the column contains lists
dplyr::select(!where(is.list)) %>%
data.frame() %>%
# no need to keep all the columns
dplyr::slice(1:10)
We can explore the data frame with the summary()
and str()
function.
summary(starwars2)
## name height mass hair_color
## Length:10 Min. : 96.0 Min. : 32.00 Length:10
## Class :character 1st Qu.:153.8 1st Qu.: 55.50 Class :character
## Mode :character Median :169.5 Median : 76.00 Mode :character
## Mean :159.2 Mean : 75.70
## 3rd Qu.:181.0 3rd Qu.: 82.25
## Max. :202.0 Max. :136.00
##
## skin_color eye_color birth_year sex
## Length:10 Length:10 Min. : 19.00 Length:10
## Class :character Class :character 1st Qu.: 24.00 Class :character
## Mode :character Mode :character Median : 41.90 Mode :character
## Mean : 44.99
## 3rd Qu.: 52.00
## Max. :112.00
## NA's :1
## gender homeworld species
## Length:10 Length:10 Length:10
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
str(starwars2)
## 'data.frame': 10 obs. of 11 variables:
## $ name : chr "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
## $ height : int 172 167 96 202 150 178 165 97 183 182
## $ mass : num 77 75 32 136 49 120 75 32 84 77
## $ hair_color: chr "blond" NA NA "none" ...
## $ skin_color: chr "fair" "gold" "white, blue" "white" ...
## $ eye_color : chr "blue" "yellow" "red" "yellow" ...
## $ birth_year: num 19 112 33 41.9 19 52 47 NA 24 57
## $ sex : chr "male" "none" "none" "male" ...
## $ gender : chr "masculine" "masculine" "masculine" "masculine" ...
## $ homeworld : chr "Tatooine" "Tatooine" "Naboo" "Tatooine" ...
## $ species : chr "Human" "Droid" "Droid" "Human" ...
We can explore specific properties of the data, such as the presence of
missing values. Although the summary()
function counts the number of
NA’s, it appears that it only does so for numerical columns.
We can use the apply(X, MARGIN, FUN)
function, which applies a
function (FUN
) over the rows (MARGIN = 1
) or columns (MARGIN = 2
)
of a matrix or data frame (X
). In this case, we can apply a
user-written function that counts the number of missing values per
column.
# it contains multiple na
apply(starwars2, MARGIN = 2, FUN = function(x) sum(is.na(x)))
## name height mass hair_color skin_color eye_color birth_year
## 0 0 0 3 0 0 1
## sex gender homeworld species
## 0 0 0 0
I will demonstrate functions using pipes: %>%
. The pipe %>%
passes
the object on the left to the first argument of the function on the
right. The dot .
can be used to specify the argument to which the
object is passed.
Throughout the script, I will demonstrate the effect of the different
functions by using the structure()
function, or str()
for short.
Instead of writing: str(starwars2)
, I will use the pipe:
starwars2 %>%
str()
## 'data.frame': 10 obs. of 11 variables:
## $ name : chr "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
## $ height : int 172 167 96 202 150 178 165 97 183 182
## $ mass : num 77 75 32 136 49 120 75 32 84 77
## $ hair_color: chr "blond" NA NA "none" ...
## $ skin_color: chr "fair" "gold" "white, blue" "white" ...
## $ eye_color : chr "blue" "yellow" "red" "yellow" ...
## $ birth_year: num 19 112 33 41.9 19 52 47 NA 24 57
## $ sex : chr "male" "none" "none" "male" ...
## $ gender : chr "masculine" "masculine" "masculine" "masculine" ...
## $ homeworld : chr "Tatooine" "Tatooine" "Naboo" "Tatooine" ...
## $ species : chr "Human" "Droid" "Droid" "Human" ...
Also note, the following functions will be demonstrated with a data frame, and there are additional formats for tabular data, such as tibble and data.table.
The filter()
takes a data frame, and returns a data frame with the
rows filtered by a set of logical conditions on a set of columns.
# keep only characters whose homeworld is Tatooine
starwars2 %>%
filter(homeworld == "Tatooine")
## name height mass hair_color skin_color eye_color birth_year
## 1 Luke Skywalker 172 77 blond fair blue 19.0
## 2 C-3PO 167 75 <NA> gold yellow 112.0
## 3 Darth Vader 202 136 none white yellow 41.9
## 4 Owen Lars 178 120 brown, grey light blue 52.0
## 5 Beru Whitesun lars 165 75 brown light blue 47.0
## 6 R5-D4 97 32 <NA> white, red red NA
## 7 Biggs Darklighter 183 84 black light brown 24.0
## sex gender homeworld species
## 1 male masculine Tatooine Human
## 2 none masculine Tatooine Droid
## 3 male masculine Tatooine Human
## 4 male masculine Tatooine Human
## 5 female feminine Tatooine Human
## 6 none masculine Tatooine Droid
## 7 male masculine Tatooine Human
# Keep only characters from Tatooine and that are above a certain height
starwars2 %>%
filter(homeworld == "Tatooine" & height > 170)
## name height mass hair_color skin_color eye_color birth_year
## 1 Luke Skywalker 172 77 blond fair blue 19.0
## 2 Darth Vader 202 136 none white yellow 41.9
## 3 Owen Lars 178 120 brown, grey light blue 52.0
## 4 Biggs Darklighter 183 84 black light brown 24.0
## sex gender homeworld species
## 1 male masculine Tatooine Human
## 2 male masculine Tatooine Human
## 3 male masculine Tatooine Human
## 4 male masculine Tatooine Human
When filtering, be mindful of missing values. See below different filtering approaches that take missing values into account.
nrow(starwars2)
## [1] 10
apply(starwars2, MARGIN = 2, FUN = function(x) sum(is.na(x)))
## name height mass hair_color skin_color eye_color birth_year
## 0 0 0 3 0 0 1
## sex gender homeworld species
## 0 0 0 0
# the table() function accepts an argument to specify whether you want data on NA
table(starwars2$hair_color, useNA = "always")
##
## auburn, white black blond brown brown, grey
## 1 1 1 2 1
## none <NA>
## 1 3
# we get only 6 rows because the NA have been filtered out
starwars2 %>%
filter(hair_color != "none")
## name height mass hair_color skin_color eye_color birth_year
## 1 Luke Skywalker 172 77 blond fair blue 19
## 2 Leia Organa 150 49 brown light brown 19
## 3 Owen Lars 178 120 brown, grey light blue 52
## 4 Beru Whitesun lars 165 75 brown light blue 47
## 5 Biggs Darklighter 183 84 black light brown 24
## 6 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray 57
## sex gender homeworld species
## 1 male masculine Tatooine Human
## 2 female feminine Alderaan Human
## 3 male masculine Tatooine Human
## 4 female feminine Tatooine Human
## 5 male masculine Tatooine Human
## 6 male masculine Stewjon Human
# this code yields the same result
starwars2 %>%
filter(hair_color != "none" & !is.na(hair_color))
## name height mass hair_color skin_color eye_color birth_year
## 1 Luke Skywalker 172 77 blond fair blue 19
## 2 Leia Organa 150 49 brown light brown 19
## 3 Owen Lars 178 120 brown, grey light blue 52
## 4 Beru Whitesun lars 165 75 brown light blue 47
## 5 Biggs Darklighter 183 84 black light brown 24
## 6 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray 57
## sex gender homeworld species
## 1 male masculine Tatooine Human
## 2 female feminine Alderaan Human
## 3 male masculine Tatooine Human
## 4 female feminine Tatooine Human
## 5 male masculine Tatooine Human
## 6 male masculine Stewjon Human
# use the OR logical operator (|) to keep characters that don't have hair or no data on hair
starwars2 %>%
filter(hair_color != "none" | is.na(hair_color))
## name height mass hair_color skin_color eye_color birth_year
## 1 Luke Skywalker 172 77 blond fair blue 19
## 2 C-3PO 167 75 <NA> gold yellow 112
## 3 R2-D2 96 32 <NA> white, blue red 33
## 4 Leia Organa 150 49 brown light brown 19
## 5 Owen Lars 178 120 brown, grey light blue 52
## 6 Beru Whitesun lars 165 75 brown light blue 47
## 7 R5-D4 97 32 <NA> white, red red NA
## 8 Biggs Darklighter 183 84 black light brown 24
## 9 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray 57
## sex gender homeworld species
## 1 male masculine Tatooine Human
## 2 none masculine Tatooine Droid
## 3 none masculine Naboo Droid
## 4 female feminine Alderaan Human
## 5 male masculine Tatooine Human
## 6 female feminine Tatooine Human
## 7 none masculine Tatooine Droid
## 8 male masculine Tatooine Human
## 9 male masculine Stewjon Human
The summarise()
function takes a data frame, and returns a data frame
with fewer rows. If there is no grouping variable (declared through
group_by()
), it will return a data frame with a single row summarizing
all observations. summarize()
is a synonym of summarise
.
mean(starwars2$height, na.rm = T)
## [1] 159.2
# we get the same value as in the previous base code approach, but in data frame format
starwars2 %>%
summarise(mean.height = mean(height, na.rm = T))
## mean.height
## 1 159.2
The group_by()
function changes the type of the data that enters into
summarise()
, as well as the behavior of summarise()
.
str(starwars2)
## 'data.frame': 10 obs. of 11 variables:
## $ name : chr "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
## $ height : int 172 167 96 202 150 178 165 97 183 182
## $ mass : num 77 75 32 136 49 120 75 32 84 77
## $ hair_color: chr "blond" NA NA "none" ...
## $ skin_color: chr "fair" "gold" "white, blue" "white" ...
## $ eye_color : chr "blue" "yellow" "red" "yellow" ...
## $ birth_year: num 19 112 33 41.9 19 52 47 NA 24 57
## $ sex : chr "male" "none" "none" "male" ...
## $ gender : chr "masculine" "masculine" "masculine" "masculine" ...
## $ homeworld : chr "Tatooine" "Tatooine" "Naboo" "Tatooine" ...
## $ species : chr "Human" "Droid" "Droid" "Human" ...
# Note that the type of object has changed
starwars2 %>%
group_by(homeworld) %>%
str()
## gropd_df [10 × 11] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ name : chr [1:10] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
## $ height : int [1:10] 172 167 96 202 150 178 165 97 183 182
## $ mass : num [1:10] 77 75 32 136 49 120 75 32 84 77
## $ hair_color: chr [1:10] "blond" NA NA "none" ...
## $ skin_color: chr [1:10] "fair" "gold" "white, blue" "white" ...
## $ eye_color : chr [1:10] "blue" "yellow" "red" "yellow" ...
## $ birth_year: num [1:10] 19 112 33 41.9 19 52 47 NA 24 57
## $ sex : chr [1:10] "male" "none" "none" "male" ...
## $ gender : chr [1:10] "masculine" "masculine" "masculine" "masculine" ...
## $ homeworld : chr [1:10] "Tatooine" "Tatooine" "Naboo" "Tatooine" ...
## $ species : chr [1:10] "Human" "Droid" "Droid" "Human" ...
## - attr(*, "groups")= tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ homeworld: chr [1:4] "Alderaan" "Naboo" "Stewjon" "Tatooine"
## ..$ .rows : list<int> [1:4]
## .. ..$ : int 5
## .. ..$ : int 3
## .. ..$ : int 10
## .. ..$ : int [1:7] 1 2 4 6 7 8 9
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
The group_by()
function changes the behavior of summarise()
: instead
of providing a single summary value, it provides a summary value per
category of the grouping variable.
When using such functions, it is a good idea to think ahead: how many
rows should we expect? You can use n_distinct()
to check how many
different groups are there, but beware of missing values.
n_distinct(starwars$homeworld)
## [1] 49
n_distinct(starwars$homeworld, na.rm = TRUE)
## [1] 48
starwars2 %>%
group_by(homeworld) %>%
summarise(mean.height = mean(height, na.rm = TRUE))
## # A tibble: 4 × 2
## homeworld mean.height
## <chr> <dbl>
## 1 Alderaan 150
## 2 Naboo 96
## 3 Stewjon 182
## 4 Tatooine 166.
The summarized dataset can be piped to ggplot()
for visualization.
Remember to use the plus symbol (+
) rather than the pipe (%>%
)
inside the ggplot()
call. A couple of functions to visualize the
datasets are: arrange()
(orders the rows of a data frame by the values
of selected columns), and desc()
(sorts in descending order: from
greatest to lowest).
# Calculate the number of characters per planet, and sort the data frame in descending order
(star.char <-
starwars2 %>%
filter(!is.na(homeworld)) %>%
group_by(homeworld) %>%
summarise(n.characters = n_distinct(name))) %>%
arrange(desc(n.characters))
## # A tibble: 4 × 2
## homeworld n.characters
## <chr> <int>
## 1 Tatooine 7
## 2 Alderaan 1
## 3 Naboo 1
## 4 Stewjon 1
You might be interested in using the results of a summary to filter the original data. For instance, keep only data from the planets that have 2 characters.
starwars2 %>%
filter(!is.na(homeworld)) %>%
group_by(homeworld) %>%
filter(n_distinct(name) == 2)
## # A tibble: 0 × 11
## # Groups: homeworld [0]
## # … with 11 variables: name <chr>, height <int>, mass <dbl>, hair_color <chr>,
## # skin_color <chr>, eye_color <chr>, birth_year <dbl>, sex <chr>,
## # gender <chr>, homeworld <chr>, species <chr>
We can combine mutate()
with if_else()
or case_when()
to create a
new column, whose values depend on some logical conditions. if_else()
handles a single logical condition (although you can evaluate multiple
conditions by writing if_else()
statements inside other if_else()
statements); while case_when()
can handle many logical conditions.
We can combine if_else()
with mutate()
to generate a new variable,
with values that depend on the values of other variables.
# Tatooine is a planet covered in deserts (there are others, but this code is to illustrate the function)
starwars2 %>%
dplyr::select(homeworld) %>%
# keep the first 10 rows
dplyr::slice(1:10) %>%
dplyr::mutate(biome = if_else(
condition = homeworld == "Tatooine",
true = "desert",
false = "other"
))
## homeworld biome
## 1 Tatooine desert
## 2 Tatooine desert
## 3 Naboo other
## 4 Tatooine desert
## 5 Alderaan other
## 6 Tatooine desert
## 7 Tatooine desert
## 8 Tatooine desert
## 9 Tatooine desert
## 10 Stewjon other
One could combine multiple if_else()
statements to evaluate multiple
logical conditions.
starwars2 %>%
dplyr::select(hair_color) %>%
dplyr::slice(1:10) %>%
mutate(hair = if_else(hair_color == "blond", true = "blond", false = if_else(hair_color == "black", true = "black", false = "other")))
## hair_color hair
## 1 blond blond
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 none other
## 5 brown other
## 6 brown, grey other
## 7 brown other
## 8 <NA> <NA>
## 9 black black
## 10 auburn, white other
Luckily for us, the case_when()
function allows to vectorise multiple
if_else()
statements. It consists of two-sided formulas: the left hand
side determines the logical conditions to be evaluated, the right hand
side of the formula determines the replacement values, which have to be
of the same type (all of them character, all of them numeric, etc). If
there is no match, the default is to return NA
, but you can change it
through TRUE ~ your_desired_output
# we can re-write the previous nested if_else like this
starwars2 %>%
dplyr::select(hair_color) %>%
dplyr::slice(1:10) %>%
mutate(
hair = case_when(
hair_color == "blond" ~ "blond",
hair_color == "black" ~ "black",
!is.na(hair_color) & !(hair_color %in% c("black", "blond")) ~ "other"
# the default behaviour in case of no match is to return NA (or NA_character_ to be consistent with the other vectors)
#TRUE ~ NA_character_
)
)
## hair_color hair
## 1 blond blond
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 none other
## 5 brown other
## 6 brown, grey other
## 7 brown other
## 8 <NA> <NA>
## 9 black black
## 10 auburn, white other
# if there is no match, NA is returned
starwars2 %>%
dplyr::select(hair_color) %>%
dplyr::slice(1:10) %>%
mutate(hair = case_when(hair_color == "blond" ~ "blond",
hair_color == "black" ~ "black"))
## hair_color hair
## 1 blond blond
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 none <NA>
## 5 brown <NA>
## 6 brown, grey <NA>
## 7 brown <NA>
## 8 <NA> <NA>
## 9 black black
## 10 auburn, white <NA>
# this lumps together no data with different hair color
starwars2 %>%
dplyr::select(hair_color) %>%
dplyr::slice(1:10) %>%
mutate(hair = case_when(hair_color == "blond" ~ "blond",
hair_color == "black" ~ "black",
TRUE ~ "other"))
## hair_color hair
## 1 blond blond
## 2 <NA> other
## 3 <NA> other
## 4 none other
## 5 brown other
## 6 brown, grey other
## 7 brown other
## 8 <NA> other
## 9 black black
## 10 auburn, white other
# this makes a differnece between no data and different hair color
starwars2 %>%
dplyr::select(hair_color) %>%
dplyr::slice(1:10) %>%
mutate(hair = case_when(hair_color == "blond" ~ "blond",
hair_color == "black" ~ "black",
is.na(hair_color) ~ "no data",
TRUE ~ "other"))
## hair_color hair
## 1 blond blond
## 2 <NA> no data
## 3 <NA> no data
## 4 none other
## 5 brown other
## 6 brown, grey other
## 7 brown other
## 8 <NA> no data
## 9 black black
## 10 auburn, white other
But what if we want to specify NA as the output in case there is no
data? Mind you that NA
is a logical, so you need to specify
NA_character_
if you want strings as the otuput.
class(NA)
## [1] "logical"
class(NA_character_)
## [1] "character"
# The right hand side of the formula needs to be of the same type, therefore, this won't work
starwars2 %>%
dplyr::select(hair_color) %>%
dplyr::slice(1:10) %>%
mutate(hair = case_when(hair_color == "blond" ~ "blond",
hair_color == "black" ~ "black",
is.na(hair_color) ~ NA,
TRUE ~ "other"))
## Error in `mutate()`:
## ! Problem while computing `hair = case_when(...)`.
## Caused by error in `` names(message) <- `*vtmp*` ``:
## ! el atributo 'names' [1] debe tener la misma longitud que el vector [0]
# we can specify NA_character_ so the right hand side of the formula is of consistent type
starwars2 %>%
dplyr::select(hair_color) %>%
dplyr::slice(1:10) %>%
mutate(hair = case_when(hair_color == "blond" ~ "blond",
hair_color == "black" ~ "black",
is.na(hair_color) ~ NA_character_,
TRUE ~ "other"))
## hair_color hair
## 1 blond blond
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 none other
## 5 brown other
## 6 brown, grey other
## 7 brown other
## 8 <NA> <NA>
## 9 black black
## 10 auburn, white other
The rowwise()
function allows to compute on a data frame one row at a
time. Similarly to group_by()
, it changes how verbs operate on the
data frame, and it also changes the type of the object. See some
examples below for a made-up data frame containing data on temperature
and precipitation recorded on 2-4 different days, across 4 sites.
(data.row <-
data.frame(Site = c("Site1", "Site2", "Site3", "Site4"),
Temp1 = c(10, 5, 8, 6),
Temp2 = c(11, 6, 5, 4),
Temp3 = c(12, 14, 15, 16),
Temp4 = c(10, 4, 7, 12),
Prec1 = c(3, 4, 5, 6),
Prec2 = c(4, 2, 2, 8)))
## Site Temp1 Temp2 Temp3 Temp4 Prec1 Prec2
## 1 Site1 10 11 12 10 3 4
## 2 Site2 5 6 14 4 4 2
## 3 Site3 8 5 15 7 5 2
## 4 Site4 6 4 16 12 6 8
str(data.row)
## 'data.frame': 4 obs. of 7 variables:
## $ Site : chr "Site1" "Site2" "Site3" "Site4"
## $ Temp1: num 10 5 8 6
## $ Temp2: num 11 6 5 4
## $ Temp3: num 12 14 15 16
## $ Temp4: num 10 4 7 12
## $ Prec1: num 3 4 5 6
## $ Prec2: num 4 2 2 8
# the object type has changed
data.row %>%
rowwise() %>%
str()
## rowws_df [4 × 7] (S3: rowwise_df/tbl_df/tbl/data.frame)
## $ Site : chr [1:4] "Site1" "Site2" "Site3" "Site4"
## $ Temp1: num [1:4] 10 5 8 6
## $ Temp2: num [1:4] 11 6 5 4
## $ Temp3: num [1:4] 12 14 15 16
## $ Temp4: num [1:4] 10 4 7 12
## $ Prec1: num [1:4] 3 4 5 6
## $ Prec2: num [1:4] 4 2 2 8
## - attr(*, "groups")= tibble [4 × 1] (S3: tbl_df/tbl/data.frame)
## ..$ .rows: list<int> [1:4]
## .. ..$ : int 1
## .. ..$ : int 2
## .. ..$ : int 3
## .. ..$ : int 4
## .. ..@ ptype: int(0)
The grouping can be undone with ungroup
(you might need an additional
function to remove the tibble class in case you don’t want it).
data.row %>%
rowwise() %>%
ungroup() %>%
str()
## tibble [4 × 7] (S3: tbl_df/tbl/data.frame)
## $ Site : chr [1:4] "Site1" "Site2" "Site3" "Site4"
## $ Temp1: num [1:4] 10 5 8 6
## $ Temp2: num [1:4] 11 6 5 4
## $ Temp3: num [1:4] 12 14 15 16
## $ Temp4: num [1:4] 10 4 7 12
## $ Prec1: num [1:4] 3 4 5 6
## $ Prec2: num [1:4] 4 2 2 8
To specify which columns should be used, you may use c()
or
c_across()
inside the function you are calling. c_across
can be
combined with tidy selection, which is always helpful.
# Calculate the minimum value per row, mind to remove the Site column which is a factor
data.row %>%
rowwise() %>%
mutate(min = min(c_across(-starts_with("Site"))))
## # A tibble: 4 × 8
## # Rowwise:
## Site Temp1 Temp2 Temp3 Temp4 Prec1 Prec2 min
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Site1 10 11 12 10 3 4 3
## 2 Site2 5 6 14 4 4 2 2
## 3 Site3 8 5 15 7 5 2 2
## 4 Site4 6 4 16 12 6 8 4
# you may specify the variables of interest individually
data.row %>%
rowwise() %>%
mutate(min = min(c(Temp1, Temp2, Temp3)))
## # A tibble: 4 × 8
## # Rowwise:
## Site Temp1 Temp2 Temp3 Temp4 Prec1 Prec2 min
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Site1 10 11 12 10 3 4 10
## 2 Site2 5 6 14 4 4 2 5
## 3 Site3 8 5 15 7 5 2 5
## 4 Site4 6 4 16 12 6 8 4
# Calculate the minimum value per row, but only for temperature
data.row %>%
rowwise() %>%
mutate(min = min(c_across(starts_with("Temp"))))
## # A tibble: 4 × 8
## # Rowwise:
## Site Temp1 Temp2 Temp3 Temp4 Prec1 Prec2 min
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Site1 10 11 12 10 3 4 10
## 2 Site2 5 6 14 4 4 2 4
## 3 Site3 8 5 15 7 5 2 5
## 4 Site4 6 4 16 12 6 8 4
# You may also obtain a reduced data frame by calling summarise()
data.row %>%
rowwise() %>%
summarise(min.temp = min(c_across(starts_with("Temp"))),
min.prec = min(c(Prec1, Prec2)))
## # A tibble: 4 × 2
## min.temp min.prec
## <dbl> <dbl>
## 1 10 3
## 2 4 2
## 3 5 2
## 4 4 6
# Alternatively, you can keep only an identifier column, so it is preserved when you call summarise()
data.row %>%
rowwise(Site) %>%
summarise(min.temp = min(c_across(starts_with("Temp"))),
min.prec = min(c(Prec1, Prec2)))
## `summarise()` has grouped output by 'Site'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups: Site [4]
## Site min.temp min.prec
## <chr> <dbl> <dbl>
## 1 Site1 10 3
## 2 Site2 4 2
## 3 Site3 5 2
## 4 Site4 4 6
This is reminiscent of the apply()
function, which can be applied on
rows (1
) and/or columns (2
), through the MARGIN
argument.
# remove the first column, as it is a factor
apply(data.row[, -1], MARGIN = 1, min)
## [1] 3 2 2 4
apply(data.row[, -1], MARGIN = 2, min)
## Temp1 Temp2 Temp3 Temp4 Prec1 Prec2
## 5 4 12 4 3 2
In the preceding examples we have performed summaries of temperature of
precipitation on a row-by-row manner (rowwise
). This is because we
have the same variable spread across multiple columns. Temperature is
spread across 3 columns, precipitation across 2 columns. Moreover, the
day in which the measure was taken is also spread, in a somewhat cryptic
way.
But what if we could re-arrange the data so that the calculation of summaries would be more straightforward? Moreover, imagine you are interested in a graph of how temperature had fluctuated across sites during those 4 days. This is difficult given the current structure of the data. What would help us is a way to re-arrange the data, such that information of a same kind (day of measurement and type of environmental variable) are in a single column.
Introducing pivot longer
, a function that increases the number of rows
and decreases the number of columns. You need to specify cols
: columns
that contain data to be pivotted to longer format (tidy selection
available), names_to
: the name of the new column that will store the
column names of the selected columns, values_to
: name of the column
that will store the values of the pivotted columns. In case you are
wondering, yes, there is a function that does the opposite:
pivot_wider()
.
You can see nice animations of the pivoting process
here.
(data.long <-
data.row %>%
pivot_longer(cols = starts_with(c("Temp", "Prec")),
values_to = "Value",
names_to = "temporary"))
## # A tibble: 24 × 3
## Site temporary Value
## <chr> <chr> <dbl>
## 1 Site1 Temp1 10
## 2 Site1 Temp2 11
## 3 Site1 Temp3 12
## 4 Site1 Temp4 10
## 5 Site1 Prec1 3
## 6 Site1 Prec2 4
## 7 Site2 Temp1 5
## 8 Site2 Temp2 6
## 9 Site2 Temp3 14
## 10 Site2 Temp4 4
## # … with 14 more rows
We still have a problem: the day in which the measure was taken is not
separated from the environmental variable. We can use separate()
,
which splits one column into multiple columns, by splitting wherever a
separator character appears, or by position (through positive or
negative numerical values). This function takes the name of the column
to separate, and the names of the columns to separate into.
(data.long2 <-
data.long %>%
# use numerical values: negative values start on the far right of the strings
separate(col = temporary, into = c("Variable", "Day"), sep = -1))
## # A tibble: 24 × 4
## Site Variable Day Value
## <chr> <chr> <chr> <dbl>
## 1 Site1 Temp 1 10
## 2 Site1 Temp 2 11
## 3 Site1 Temp 3 12
## 4 Site1 Temp 4 10
## 5 Site1 Prec 1 3
## 6 Site1 Prec 2 4
## 7 Site2 Temp 1 5
## 8 Site2 Temp 2 6
## 9 Site2 Temp 3 14
## 10 Site2 Temp 4 4
## # … with 14 more rows
In a long format, obtaining summaries per site or day is quite straightfoward.
# find lowest temperature across sites
data.long2 %>%
filter(Variable == "Temp") %>%
group_by(Site) %>%
summarise(min = min(Value))
## # A tibble: 4 × 2
## Site min
## <chr> <dbl>
## 1 Site1 10
## 2 Site2 4
## 3 Site3 5
## 4 Site4 4
# find lowest temperature across days
data.long2 %>%
filter(Variable == "Temp") %>%
group_by(Day) %>%
summarise(min = min(Value))
## # A tibble: 4 × 2
## Day min
## <chr> <dbl>
## 1 1 5
## 2 2 4
## 3 3 12
## 4 4 4
The long format is also better for plotting. The following graph, showing how temperature had fluctuated across sites during those 4 days, would be difficult to draw based on the original data.
ggplot(data = filter(data.long2, Variable == "Temp"), aes(x = Day, y = Value, group = Site, col = Site)) +
geom_point() +
geom_line()
A database might contain empty rows that represent that the previous
entry hasn’t changed. For instance, the following made-up data frame
records the abundance of some trees across locations, and the blank
space indicates that the species hasn’t changed. The tidyr::fill()
function takes a data.frame, a selection of columns to fill (you may use
tidy selection). The .direction
argument specifies how the missing, in
this case we should specify down
, but there are other options.
# Wrapping the code in parentheses prints the created object
(data.fill <-
data.frame(species = c("Ailanthus altissima", NA, NA, "Robinia pseudoacacia", NA, "Sambucus nigra", NA),
site = c("Site1", "Site2", "Site3", "Site1", "Site4", "Site2", "Site3"),
abundance = c(1:7)))
## species site abundance
## 1 Ailanthus altissima Site1 1
## 2 <NA> Site2 2
## 3 <NA> Site3 3
## 4 Robinia pseudoacacia Site1 4
## 5 <NA> Site4 5
## 6 Sambucus nigra Site2 6
## 7 <NA> Site3 7
# fill the missing values downwards (the last non-missing values is carried down)
data.fill %>%
tidyr::fill(species, .direction = "down")
## species site abundance
## 1 Ailanthus altissima Site1 1
## 2 Ailanthus altissima Site2 2
## 3 Ailanthus altissima Site3 3
## 4 Robinia pseudoacacia Site1 4
## 5 Robinia pseudoacacia Site4 5
## 6 Sambucus nigra Site2 6
## 7 Sambucus nigra Site3 7
Depending on the type of data operations and analyses you are performing, you might have come across situations in which you are concerned about how much time it takes to run some code. Maybe you are dealing with large databases, or have to perform the same operation many times. Or maybe you’ve read about a packge or function that claims to be faster than the package or funciton you are currently using.
How to test for yourself whether you can benefit from that alleged
greater speed? You can use the microbenchmark
package to compare the
time it takes to run a set of expressions. If the package multcomp
is
available, a statistical ranking based on pairwise differences is
displayed in the cld
column (compact letter display).
Let’s compare the running time of if_else
(tidyverse) and ifelse()
(base). According to the documentation, the tidyverse function is more
strict, checking that the true
and false
are the same type. As a
result, it is faster. Let’s test this claim with some randomly generated
vectors.
# create vectors of different length to compare the running time between ifelse() and if_else()
# runif: random uniform distribution, which you can use to generate a vector of random numbers of a specified length
short.vector <- runif(n = 100, min = 0, max = 10^6)
long.vector <- runif(n = 10^6, min = 0, max = 10^6)
microbenchmark::microbenchmark(
ifelse(
test = short.vector >= 1000,
yes = "greater",
no = "lower"
),
if_else(
condition = short.vector >= 1000,
true = "greater",
false = "lower"
),
ifelse(
test = long.vector >= 1000,
yes = "greater",
no = "lower"
),
if_else(
condition = long.vector >= 1000,
true = "greater",
false = "lower"
),
# Number of times to evaluate the expression
times = 20,
# Specify the unit in which the running time will printed. I think the default units are nanoseconds, though I am not sure
unit = "s"
)
## Unit: seconds
## expr
## ifelse(test = short.vector >= 1000, yes = "greater", no = "lower")
## if_else(condition = short.vector >= 1000, true = "greater", false = "lower")
## ifelse(test = long.vector >= 1000, yes = "greater", no = "lower")
## if_else(condition = long.vector >= 1000, true = "greater", false = "lower")
## min lq mean median uq max
## 0.000025701 0.0000321010 4.316605e-05 0.0000408020 0.0000553005 0.000062100
## 0.000025102 0.0000494515 6.209130e-05 0.0000647015 0.0000704015 0.000142401
## 0.202750001 0.2064859005 2.205115e-01 0.2090442010 0.2148032505 0.323478501
## 0.034671602 0.0410708505 6.129297e-02 0.0452847010 0.0492832005 0.165058601
## neval cld
## 20 a
## 20 a
## 20 c
## 20 b
To assess the consistency of our results, let’s compare the two
functions again, this time using letters instead of numbers. The
letters
object contains the 26 lower-case letters of the Roman
alphabet.
short.letters <- rep(letters, 100)
long.letters <- rep(letters, 10^5)
microbenchmark::microbenchmark(
ifelse(
test = short.letters == "a",
yes = "first letter",
no = "other"
),
if_else(
condition = short.letters == "a",
true = "first letter",
false = "other"
),
ifelse(
test = long.letters == "a",
yes = "first letter",
no = "other"
),
if_else(
condition = long.letters == "a",
true = "first letter",
false = "other"
),
times = 20,
unit = "s"
)
## Unit: seconds
## expr
## ifelse(test = short.letters == "a", yes = "first letter", no = "other")
## if_else(condition = short.letters == "a", true = "first letter", false = "other")
## ifelse(test = long.letters == "a", yes = "first letter", no = "other")
## if_else(condition = long.letters == "a", true = "first letter", false = "other")
## min lq mean median uq max
## 0.000511301 0.0005368505 0.0005543260 0.0005479005 0.000574001 0.000606101
## 0.000099800 0.0001222010 0.0006600707 0.0001526510 0.000165051 0.010432401
## 0.542050801 0.5535287510 0.5831043410 0.5604768010 0.578221551 0.695208601
## 0.106805101 0.1091542520 0.1390867409 0.1214569015 0.126815701 0.237438200
## neval cld
## 20 a
## 20 a
## 20 c
## 20 b
https://stackoverflow.com/questions/68460102/mutate-inside-a-function-object-not-found https://stackoverflow.com/questions/26003574/use-dynamic-variable-names-in-dplyr https://b-rodrigues.github.io/modern_R/defining-your-own-functions.html#functions-that-take-columns-of-data-as-arguments https://b-rodrigues.github.io/modern_R/defining-your-own-functions.html#functions-that-take-columns-of-data-as-arguments