library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Note that in R, there are usually multiple ways to tackle the same problem. Subsetting, filtering, renaming, etc., can also be achieved via base R.
- R for data science (Wickham&Grolemund, 2017): a good guide to the tidyverse. Equally importantly, it can teach you about the phylosophy of tidy data, and the general “know-how” of data processing and analysis. The book has plenty of exercises for you to build “coding muscle”. You can find unofficial solutions tot he exercises here.
- Cheatsheets covering tidyverse topics, which you can find online, or in RStudio itself (Help > Cheatsheets).
- Crawley (2013): complete reference for basic coding, mathematics, statistics, etc.
I will demonstrate most of the tidyverse functions using the mtcars
dataset, containing 11 aspects of automobile design and performance for
32 automobiles. Type ?mtcars
if you want to learn more about this data
frame.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Note that the mtcars
dataset is a well-behaved, tidy dataset. In the
real world, datasets are usually messy (sometimes through the fault of
others, sometimes through our own). They might contain missing cases
and/or duplicated observations and/or weird formats and/or whatever
problem you can imagine.
Since untidy datasets are so prevalent, we will work with a more
realistic dataset: I will add NA
, NaN
and Inf
. In order to
demonstrate column selection based on the type of information stored in
the column, I add a column with characters (the original dataset
contains only columns with numbers).
# For simplicity and clarity, let's keep only a few columns, and a few rows
data <- mutate(mtcars, something = "a") %>%
dplyr::select(mpg, cyl, disp, carb, something) %>%
dplyr::slice(1:10)
data[c(1:2, 5), 1] <- NA
data[c(2:3, 6), 2] <- NaN
data[4:6, 3] <- Inf
data
## mpg cyl disp carb something
## Mazda RX4 NA 6 160.0 4 a
## Mazda RX4 Wag NA NaN 160.0 4 a
## Datsun 710 22.8 NaN 108.0 1 a
## Hornet 4 Drive 21.4 6 Inf 1 a
## Hornet Sportabout NA 8 Inf 2 a
## Valiant 18.1 NaN Inf 1 a
## Duster 360 14.3 8 360.0 4 a
## Merc 240D 24.4 4 146.7 2 a
## Merc 230 22.8 4 140.8 2 a
## Merc 280 19.2 6 167.6 4 a
I will demonstrate functions using pipes: %>%
. The pipe %>%
passes
the object on the left to the first argument of the function on the
right. The dot .
can be used to specify the argument to which the
object is passed. Check the pull()
function lower in the script for an
example of the use of the dot as placeholder.
Throughout the script, I will demonstrate the effect of the different
functions by using the structure()
function, or str()
for short.
Instead of writing: str(data)
, I will use the pipe:
data %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
Also note, the following functions will be demonstrated with a data frame, and there are additional formats for tabular data, such as tibble and data.table.
The dplyr::select()
takes a data frame and returns a data frame (or
takes a tibble and returns a tibble, etc), allowing to select columns by
name and/or numerical position. If you select a singe column, you will
get a data frame with a single column, not a vector (see the pull()
function below to get vectors from data frames). Note that the order in
which variables are selected is preserved in the resulting data frame.
data %>%
dplyr::select(mpg) %>%
str()
## 'data.frame': 10 obs. of 1 variable:
## $ mpg: num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
# use the minus sign to remove variables
data %>%
dplyr::select(- mpg) %>%
str()
## 'data.frame': 10 obs. of 4 variables:
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
# use c() to (de)select multiple columns
data %>%
dplyr::select(-c(mpg, cyl)) %>%
str()
## 'data.frame': 10 obs. of 3 variables:
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
# Use numerical position of columns. 1:3 is interpreted as range of consecutive values: from 1 to 3 including all integers in between
data %>%
dplyr::select(1:3) %>%
str()
## 'data.frame': 10 obs. of 3 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp: num 160 160 108 Inf Inf ...
# Remove columns base on position
data %>%
dplyr::select(-c(1:3)) %>%
str()
## 'data.frame': 10 obs. of 2 variables:
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
# Combine numbers and names
data %>%
dplyr::select(1, disp) %>%
str()
## 'data.frame': 10 obs. of 2 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ disp: num 160 160 108 Inf Inf ...
The use of dplyr::select()
can be enhanced through selection helpers,
of which you can see a non-exhaustive selection below. To learn more
about selection of variables: ?dplyr_tidy_select
Note that the default
setting is ignore.case = TRUE
, which means the default behavior is not
case-sensitive.
data %>%
dplyr::select(everything()) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
data %>%
dplyr::select(starts_with("c")) %>%
str()
## 'data.frame': 10 obs. of 2 variables:
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ carb: num 4 4 1 1 2 1 4 2 2 4
data %>%
dplyr::select(ends_with("c")) %>%
str()
## 'data.frame': 10 obs. of 0 variables
data %>%
dplyr::select(contains("c")) %>%
str()
## 'data.frame': 10 obs. of 2 variables:
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ carb: num 4 4 1 1 2 1 4 2 2 4
data %>%
dplyr::select(contains("C")) %>%
str()
## 'data.frame': 10 obs. of 2 variables:
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ carb: num 4 4 1 1 2 1 4 2 2 4
# use the ignore.case argument to specify a case-sensitive behaviour
data %>%
dplyr::select(contains("C", ignore.case = FALSE)) %>%
str()
## 'data.frame': 10 obs. of 0 variables
In some cases, you might store the variables of interest in a character
vector. The functions all_of()
to any_of()
, get the work done, but
have slighlty different properties. The all_of()
throws error if at
least one element in the character vector is absent from the dataset
my_selection <- c("mpg", "disp")
my_other_selection <- c("mpg", "gangman_style")
# all_of() for strict selection
data %>%
dplyr::select(all_of(my_selection)) %>%
str()
## 'data.frame': 10 obs. of 2 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ disp: num 160 160 108 Inf Inf ...
# use the minus sign to remove a selection
data %>%
dplyr::select(-all_of(my_selection)) %>%
str()
## 'data.frame': 10 obs. of 3 variables:
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
# we get an error message if the character vector contains names that are absent from the column names
data %>%
dplyr::select(-all_of(my_other_selection)) %>%
str()
## Error in `dplyr::select()`:
## ! Can't subset columns that don't exist.
## ✖ Column `gangman_style` doesn't exist.
On the other hand, any_of()
doesn’t check for missing variables.
data %>%
dplyr::select(any_of(my_selection)) %>%
str()
## 'data.frame': 10 obs. of 2 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ disp: num 160 160 108 Inf Inf ...
# Won't give error message if a name is missing
data %>%
dplyr::select(any_of(my_other_selection)) %>%
str()
## 'data.frame': 10 obs. of 1 variable:
## $ mpg: num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
# use minus sign to remove variables
data %>%
dplyr::select(-any_of(my_other_selection)) %>%
str()
## 'data.frame': 10 obs. of 4 variables:
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
You can also select columns using functions: the where()
selection
helper selects the variables for which a function returns TRUE
.
Specifically, where()
takes as argument a function that returns TRUE
or FALSE
, or purr-like formula (read the documentation to find out
more). Mind that since we are dealing with logical operations, you
should use the exclamation mark !
rather than the minus -
if you
want to use where()
to remove (= de-select) columns. You could use
that to select only columns belonging to a certain type, or that have a
mean above a threshold (see documentation for an example).
# keep only columns of type character
data %>%
dplyr::select(where(is.character)) %>%
str()
## 'data.frame': 10 obs. of 1 variable:
## $ something: chr "a" "a" "a" "a" ...
# keep columns that are not character
data %>%
dplyr::select(!where(is.character)) %>%
str()
## 'data.frame': 10 obs. of 4 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp: num 160 160 108 Inf Inf ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4
# keep columns that are character and another variable (selected by name in this instance)
data %>%
dplyr::select(where(is.character), mpg) %>%
str()
## 'data.frame': 10 obs. of 2 variables:
## $ something: chr "a" "a" "a" "a" ...
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
# Note that if we reverse the order of the selection, the order of the columns changes
data %>%
dplyr::select(mpg, where(is.character)) %>%
str()
## 'data.frame': 10 obs. of 2 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ something: chr "a" "a" "a" "a" ...
While the dplyr::select()
function takes a data frame and returns a
data frame, the pull()
function takes a data frame and returns a
vector. This makes it quite similar to the dollar operator $
. You can
also use the dollar operator in pipes, relying on the dot .
as
placeholder. Remember, in R, the same thing can be achieved in many
different ways.
# 4 different ways to get the same vector from a data.frame
data %>%
pull(mpg)
## [1] NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
data$mpg
## [1] NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
data %>%
.$mpg
## [1] NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
data %>%
dplyr::select(mpg) %>%
unlist() %>%
as.numeric()
## [1] NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
The mutate function will be one of your best friends. To be brief, it
creates new variables (i.e. columns). To be less brief, it takes a data
frame, and returns a data frame with columns that are transformed in
some way. Note that the examples below are concerned with creating new
variables operating on the columns, to create new variables operating on
rows (such as creating summaries per group such as the mean or the
number of observations, among others), you might want to look into
group_by
, summarise
(summarize
is a synonym), rowwise
, etc.
# Sum numerical variables
data %>%
mutate(pointless_sum = mpg + cyl + disp + carb) %>%
print()
## mpg cyl disp carb something pointless_sum
## Mazda RX4 NA 6 160.0 4 a NA
## Mazda RX4 Wag NA NaN 160.0 4 a NaN
## Datsun 710 22.8 NaN 108.0 1 a NaN
## Hornet 4 Drive 21.4 6 Inf 1 a Inf
## Hornet Sportabout NA 8 Inf 2 a NA
## Valiant 18.1 NaN Inf 1 a NaN
## Duster 360 14.3 8 360.0 4 a 386.3
## Merc 240D 24.4 4 146.7 2 a 177.1
## Merc 230 22.8 4 140.8 2 a 169.6
## Merc 280 19.2 6 167.6 4 a 196.8
# Don't be shy to put transformations in the sum!
data %>%
mutate(even_more_pointless_sum = mpg + log10(cyl) + sqrt(disp)) %>%
print()
## mpg cyl disp carb something even_more_pointless_sum
## Mazda RX4 NA 6 160.0 4 a NA
## Mazda RX4 Wag NA NaN 160.0 4 a NaN
## Datsun 710 22.8 NaN 108.0 1 a NaN
## Hornet 4 Drive 21.4 6 Inf 1 a Inf
## Hornet Sportabout NA 8 Inf 2 a NA
## Valiant 18.1 NaN Inf 1 a NaN
## Duster 360 14.3 8 360.0 4 a 34.17676
## Merc 240D 24.4 4 146.7 2 a 37.11404
## Merc 230 22.8 4 140.8 2 a 35.26798
## Merc 280 19.2 6 167.6 4 a 32.92419
# Apply a function, such as a logarithm
data %>%
mutate(mpg_log10 = log10(disp)) %>%
print()
## mpg cyl disp carb something mpg_log10
## Mazda RX4 NA 6 160.0 4 a 2.204120
## Mazda RX4 Wag NA NaN 160.0 4 a 2.204120
## Datsun 710 22.8 NaN 108.0 1 a 2.033424
## Hornet 4 Drive 21.4 6 Inf 1 a Inf
## Hornet Sportabout NA 8 Inf 2 a Inf
## Valiant 18.1 NaN Inf 1 a Inf
## Duster 360 14.3 8 360.0 4 a 2.556303
## Merc 240D 24.4 4 146.7 2 a 2.166430
## Merc 230 22.8 4 140.8 2 a 2.148603
## Merc 280 19.2 6 167.6 4 a 2.224274
# Change the type of variable
data %>%
mutate(mpg = as.character(mpg)) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ mpg : chr NA NA "22.8" "21.4" ...
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
# Add row numbers
data %>%
mutate(row.n = row_number()) %>%
str()
## 'data.frame': 10 obs. of 6 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
## $ row.n : int 1 2 3 4 5 6 7 8 9 10
But what if we wanted to apply a function to many columns at once?
across(.cols = everything(), .fns)
to the rescue! You can combine with
the helpers for tidy selection you have seen in the previous section.
# apply a function to all columns (default behavior)
data %>%
mutate(across(.fns = as.character)) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ mpg : chr NA NA "22.8" "21.4" ...
## $ cyl : chr "6" "NaN" "NaN" "6" ...
## $ disp : chr "160" "160" "108" "Inf" ...
## $ carb : chr "4" "4" "1" "1" ...
## $ something: chr "a" "a" "a" "a" ...
# apply a function to the columns whose name meets a condition
data %>%
mutate(across(.cols = starts_with("c"), .fns = as.character)) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ cyl : chr "6" "NaN" "NaN" "6" ...
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : chr "4" "4" "1" "1" ...
## $ something: chr "a" "a" "a" "a" ...
# apply a function to columns selected by name
data %>%
mutate(across(.cols = c(mpg, cyl, disp), .fns = sqrt)) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ mpg : num NA NA 4.77 4.63 NA ...
## $ cyl : num 2.45 NaN NaN 2.45 2.83 ...
## $ disp : num 12.6 12.6 10.4 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
# write a function yourself
data %>%
mutate(across(.cols = !where(is.character), .fns = function(x) x/10 + 5)) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ mpg : num NA NA 7.28 7.14 NA 6.81 6.43 7.44 7.28 6.92
## $ cyl : num 5.6 NaN NaN 5.6 5.8 NaN 5.8 5.4 5.4 5.6
## $ disp : num 21 21 15.8 Inf Inf ...
## $ carb : num 5.4 5.4 5.1 5.1 5.2 5.1 5.4 5.2 5.2 5.4
## $ something: chr "a" "a" "a" "a" ...
data %>%
mutate(across(.cols = !where(is.character), .fns = function(x) round(x/10 + 5, digits = 0))) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ mpg : num NA NA 7 7 NA 7 6 7 7 7
## $ cyl : num 6 NaN NaN 6 6 NaN 6 5 5 6
## $ disp : num 21 21 16 Inf Inf ...
## $ carb : num 5 5 5 5 5 5 5 5 5 5
## $ something: chr "a" "a" "a" "a" ...
# apply a function and provide a regular name for the resulting variables
data %>%
mutate(across(.cols = c(mpg, cyl, disp), .fns = sqrt, .names = "{.col}.sqrt")) %>%
str()
## 'data.frame': 10 obs. of 8 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
## $ mpg.sqrt : num NA NA 4.77 4.63 NA ...
## $ cyl.sqrt : num 2.45 NaN NaN 2.45 2.83 ...
## $ disp.sqrt: num 12.6 12.6 10.4 Inf Inf ...
Sometimes we might wish to transform a certain value into NA
. You may
use na_if(x, y)
for that, which comes with two arguments: a vector to
modify (x
) and the value to replace with NA
(y
).
# check the variables cyl and carb: the number 4 has been replaced by NA
# the second argument to na_if (y) is separated by a comma, although you could avoid that by writing it as a function
data %>%
mutate(across(.cols = starts_with("c"), .fns = na_if, y = 4)) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ cyl : num 6 NaN NaN 6 8 NaN 8 NA NA 6
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num NA NA 1 1 2 1 NA 2 2 NA
## $ something: chr "a" "a" "a" "a" ...
data %>%
mutate(across(.cols = starts_with("c"), .fns = function(whatev) na_if(x = whatev, y = 4))) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ cyl : num 6 NaN NaN 6 8 NaN 8 NA NA 6
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num NA NA 1 1 2 1 NA 2 2 NA
## $ something: chr "a" "a" "a" "a" ...
In some other cases, we might want to replace NA
with a certain value.
You may use na_replace(data, replace)
for that, which comes with two
arguments: data frame or vector (data
) and a list of values or a
single value to replace the NA
, depending on whether the input is a
data frame or vector (replace
).
# note that there might be easier, more elegant, and more tidy-versy ways to calculate how many values are NA per column
data %>%
apply(2, is.na) %>%
apply(2, sum)
## mpg cyl disp carb something
## 3 3 0 0 0
# replace those NA with 9000. Note the second argument is separated by a comma
data %>%
mutate(across(.cols = 1:3, .fns = replace_na, 9000)) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ mpg : num 9000 9000 22.8 21.4 9000 18.1 14.3 24.4 22.8 19.2
## $ cyl : num 6 9000 9000 6 8 9000 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
The arrange()
function takes a data frame, and returns a data frame in
which the rows are sorted in ascending order (from small to big) based
on some column(s). To sort in descending order (from big to small),
combine arrange()
with the desc()
function:
arrange(data, desc(column))
.
Note that NA
are sorted to the end!
The base function sort()
has a slightly different behavior
# sort data frame in ascending order of miles per gallon (mpg)
data %>%
arrange(mpg)
## mpg cyl disp carb something
## Duster 360 14.3 8 360.0 4 a
## Valiant 18.1 NaN Inf 1 a
## Merc 280 19.2 6 167.6 4 a
## Hornet 4 Drive 21.4 6 Inf 1 a
## Datsun 710 22.8 NaN 108.0 1 a
## Merc 230 22.8 4 140.8 2 a
## Merc 240D 24.4 4 146.7 2 a
## Mazda RX4 NA 6 160.0 4 a
## Mazda RX4 Wag NA NaN 160.0 4 a
## Hornet Sportabout NA 8 Inf 2 a
# sort data frame in descending order of miles per gallon (mpg)
data %>%
arrange(desc(mpg))
## mpg cyl disp carb something
## Merc 240D 24.4 4 146.7 2 a
## Datsun 710 22.8 NaN 108.0 1 a
## Merc 230 22.8 4 140.8 2 a
## Hornet 4 Drive 21.4 6 Inf 1 a
## Merc 280 19.2 6 167.6 4 a
## Valiant 18.1 NaN Inf 1 a
## Duster 360 14.3 8 360.0 4 a
## Mazda RX4 NA 6 160.0 4 a
## Mazda RX4 Wag NA NaN 160.0 4 a
## Hornet Sportabout NA 8 Inf 2 a
The dplyr::relocate()
function takes a data frame, and returns a data
frame with reordered columns. Use either .before
or .after
arguments
to indicate the position (relative to some column) where you want to
relocate your column(s). Combine .before
or .after
with last_col()
to specify a relocated position relative to the last column. You can use
tidy selection functions introduced in previous sections.
data %>%
dplyr::relocate(mpg, .after = cyl) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
# Note the use of last_col() and the difference between .after and .before
data %>%
dplyr::relocate(mpg, .after = last_col()) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
data %>%
dplyr::relocate(mpg, .before = last_col()) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ something: chr "a" "a" "a" "a" ...
# combine this with tidy selection
data %>%
dplyr::relocate(starts_with("c"), .after = last_col()) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ disp : num 160 160 108 Inf Inf ...
## $ something: chr "a" "a" "a" "a" ...
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ carb : num 4 4 1 1 2 1 4 2 2 4
The rename()
takes a data frame, and returns a data frame with changed
column names, like this: new_name = old_name
.
data %>%
rename(miles.per.gallon = mpg) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ miles.per.gallon: num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something : chr "a" "a" "a" "a" ...
You may use quotations to handle names that contain blank spaces, although it is not advisable to include blank spaces in variables names
data %>%
rename("miles per gallon" = mpg) %>%
str()
# this won't work
data %>%
rename(miles per gallon = mpg) %>%
str()
## Error: <text>:7:16: unexpected symbol
## 6: data %>%
## 7: rename(miles per
## ^
This can be enhanced with rename_with(.cols = everything(), .fn)
and
tidy selection, and applying regular expressions for the manipulation of
strings.
# change all names to uppercase, .cols defaults to everything()
data %>%
rename_with(.fn = toupper) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ MPG : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ CYL : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ DISP : num 160 160 108 Inf Inf ...
## $ CARB : num 4 4 1 1 2 1 4 2 2 4
## $ SOMETHING: chr "a" "a" "a" "a" ...
# this alternative syntax returns the same result
data %>%
rename_with(.fn = ~toupper(.)) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ MPG : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ CYL : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ DISP : num 160 160 108 Inf Inf ...
## $ CARB : num 4 4 1 1 2 1 4 2 2 4
## $ SOMETHING: chr "a" "a" "a" "a" ...
# change a few selected names to uppercase
data %>%
rename_with(.cols = starts_with("c"), .fn = toupper) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ CYL : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ CARB : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
# use regular expressions
data %>%
rename_with(.cols = everything(), .fn = ~ str_replace(string = ., pattern = "a", replacement = "AAA")) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ cAAArb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
# alternative syntax with the same result
data %>%
rename_with(.cols = everything(), .fn = str_replace, pattern = "a", replacement = "AAA") %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ mpg : num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ cAAArb : num 4 4 1 1 2 1 4 2 2 4
## $ something: chr "a" "a" "a" "a" ...
The select()
function can also be used to change the name of selected
columns, similarly to rename
, specifying new_name = old_name
.
data %>%
dplyr::select(miles.per.galon = mpg) %>%
str()
## 'data.frame': 10 obs. of 1 variable:
## $ miles.per.galon: num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
data %>%
dplyr::select(miles.per.galon = mpg, everything()) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ miles.per.galon: num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something : chr "a" "a" "a" "a" ...
data %>%
dplyr::select("miles per galon" = mpg, everything()) %>%
str()
## 'data.frame': 10 obs. of 5 variables:
## $ miles per galon: num NA NA 22.8 21.4 NA 18.1 14.3 24.4 22.8 19.2
## $ cyl : num 6 NaN NaN 6 8 NaN 8 4 4 6
## $ disp : num 160 160 108 Inf Inf ...
## $ carb : num 4 4 1 1 2 1 4 2 2 4
## $ something : chr "a" "a" "a" "a" ...
For a complete list of shortcuts: Tools > Keyboard Shortcuts Help. From the whole range of shortcuts, I highlight the following:
Shortcut | Result |
---|---|
Control + Enter | Run code (line or selection) |
Control + Shift + r | Create and name section |
Control + Shift + a | Format code (line or selection) |
Control + Shift + c | Comment code (line or selection) |
Control + Shift + m | Write pipe %>% |
ts + Tabulator + Enter | Create a section with the date and hour |
Here’s an extra tip: when you create an object, wrap the line in
parentheses to print the created object. This removes the need to use an
additional line of code to print()
the object, since objects are not
printed when they are created.
sample.df <- data.frame(Var1 = c(1, 2, 3), Var2 = c(1, 2, 3))
(sample.df <- data.frame(Var1 = c(1, 2, 3), Var2 = c(1, 2, 3)))
## Var1 Var2
## 1 1 1
## 2 2 2
## 3 3 3
I like to start my scripts indicating when the script was created, as well as adding timestamps if the code has been revised substantially. Moreover, creating numbered sections with informative titles helps a lot to give order and order to the code.
- Create time stamps (
ts + Tabulator + Enter
). This will create a section with the date and hour - Create section (
Control + Shift + r
). Note that sections can be created by writing 5 or more #
Moreover, it can be a good idea to follow a consistent order in all
scripts. For instance, start the script by loading libraries and files.
In addition, you may store the libraries, and even paths to the
different files, in a separate scripts, and run them at the beginning of
the code through the source()
function.