-
Notifications
You must be signed in to change notification settings - Fork 32
/
index.Rmd
244 lines (202 loc) · 13.5 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
---
title: "Modern R with the tidyverse"
author: "Bruno Rodrigues"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
#documentclass: book
bibliography: [book.bib, packages.bib]
biblio-style: apalike
link-citations: yes
description: "This book will teach you how to use R to solve your statistical, data science and machine learning problems. Importing data, computing descriptive statistics, running regressions (or more complex machine learning models) and generating reports are some of the topics covered. No previous experience with R is needed."
---
# Preface {-}
```{r, include=FALSE}
library("tidyverse")
library("broom")
library("ggthemes")
library("janitor")
library("mlbench")
library("mlrMBO")
library("lhs")
library("tidymodels")
library("glmnet")
library("margins")
library("marginaleffects")
library("ggfortify")
library("Ecdat")
library("stopwords")
library("rio")
```
## Note to the reader {-}
I have been working on this on and off for the past 4 years or so. In 2022, I have updated the
contents of the book to reflect updates introduced with R 4.1 and in several packages (especially
those from the `{tidyverse}`). I have also cut some content that I think is not that useful,
especially in later chapters.
This book is still being written. Chapters 1 to 8 are almost ready, but more content is being added
(especially to chapter 8). 9 and 10 are empty for now. Some exercises might be at the wrong place
too and more are coming.
You can purchase an ebook version of this book on [leanpub](https://leanpub.com/modern_tidyverse).
The version on leanpub is quite out of date, so if you buy it, it's really just to send some money
my way, so many thanks for that! You can also support me by [buying me a
coffee](https://www.buymeacoffee.com/brodriguesco) or
[paypal.me](https://www.paypal.me/brodriguesco).
## What is R? {-}
Read R's official answer to this question
[here](https://cran.r-project.org/doc/FAQ/R-FAQ.html#What-is-R_003f). To make it short: R is a
multi-paradigm (procedural, imperative, object-oriented and functional)^[In this book we are going
to focus on R's functional programming capabilities] programming language that
focuses on applications in *statistics*. By *statistics* I mean any field that uses statistics such
as official statistics, economics, finance, data science, machine learning, etc. For the sake of
simplicity, I will use the word "statistics" as a general term that encompasses all these fields and
disciplines for the remainder of this book.
## Who is this book for? {-}
This book can be useful to different audiences. If you have never used R in your life, and want
to start, start with Chapter 1 of this book. Chapter 1 to 3 are the very basics, and should be
easy to follow up to Chapter 7.
Starting with Chapter 7, it gets more technical, and will be harder to follow. But I suggest
you keep on going, and do not hesitate to contact me for help if you struggle! Chapter 7
is also where you can start if you are already familiar with R **and** the `{tidyverse}`, but not
functional programming. If you are familiar with R but not the `{tidyverse}` (or have no clue
what the `{tidyverse}` is), then you can start with Chapter 4. If you are familiar with R, the
`{tidyverse}` and functional programming, you might still be interested in this book, especially
Chapter 9 and 10, which deal with package development and further advanced topics respectively.
## Why this book? {-}
This book is first and foremost for myself. This book is the result of years of using and teaching
R at university and then at my jobs. During my university time, I wrote some notes to help me
teach R and which I shared with my students. These are still the basis of Chapter 2. Then, once
I had left university, and continued using R at my first "real" job, I wrote another book that
dealt mostly with package development and functional programming. This book is now merged to this
one and is the basis of Chapters 9 and 10. During these years at my first
job, I was also tasked with teaching R. By that time, I was already quite familiar with the
`{tidyverse}` so I wrote a lot of notes that were internal and adapted for the audience of my
first job. These are now the basis of Chapters 3 to 8.
Then, during all these years, I kept blogging about R, and reading blogs and further books. All
this knowledge is condensed here, so if you are familiar with my blog, you'll definitely recognize
a lot of my blog posts in here. So this book is first and foremost for me, because I need to write
all of this down in a central place. So because my target audience is myself, this book is free. If
you find it useful, and are in the mood of buying me a coffee, you can, but if this book is not
useful to you, no harm done (unless you paid for it before reading it, in which case, I am sorry
to have wasted your time). But I am quite sure you'll find some of the things written here useful,
regardless of your current experience level with R.
## Why *modern* R? {-}
*Modern* R instead of "just" R because we are going to learn how to use modern packages (mostly
those from the [tidyverse](https://www.tidyverse.org/)) and concepts, such as functional
programming (which is quite an old concept actually, but one that came into fashion recently). R is
derived from S, which is a programming language that has roots in FORTRAN and other languages too.
If you learned R at university, you've probably learned to use it as you would have used FORTRAN;
very long scripts where data are represented as matrices and where row-wise (or column-wise)
operations are implemented with `for` loops. There's nothing wrong with that, mind you, but R
was also influenced by Scheme and Common Lisp, which are functional programming languages.
In my opinion, functional programming is a programming paradigm that works really well when dealing
with statistical problems. This is because programming in a functional style is just like
writing math. For instance, suppose you want to sum all the elements of a vector. In mathematical
notation, you would write something like:
\[
\sum_{i = 1}^{100} x_{i}
\]
where $x$ is a vector of length 100. Solving this using a loop would look something like this:
```{r, eval=FALSE}
res <- 0
for(i in 1:length(x)){
res <- x[i] + res
}
```
This does not look like the math notation at all! You have to define a variable that will hold
the result outside of the loop, and then you have to define `res` as something plus `res` inside
the body of the loop. This is really unnatural. The functional programming approach is much
easier:
```{r, eval=FALSE}
Reduce(`+`, x)
```
We will learn about `Reduce()` later (to be more precise, we will learn about `purrr::reduce()`,
the "tidy" version of `Reduce()`), but already you see that the notation looks a lot more
like the mathematical notation.
At its core, functional programming uses functions, and functions are so-called *first
class* objects in R, which means that there is nothing special about them... you can pass them to
other functions, create functions that return functions and do any kind of operation on them just as
with any other object. This means that functions in R are extremely powerful and flexible tools.
In the first part of the book, we are going to use functions that are already available in R, and
then use those available in packages, mostly those from the `tidyverse`. The `tidyverse` is a
collection of packages developed by [Hadley Wickham](http://hadley.nz/), and several of his colleagues
at RStudio, Inc. By using the packages from the `tidyverse` and R's built-in functional programming
capabilities, we can write code that is faster and easier to explain to colleagues, and also easier
to maintain. This also means that you might have to change your expectations and what you know
already from R, if you learned it at University but haven't touched it in a long time. For example
for and while loops, are relegated to chapter 8. This does not mean that you will have to wait for
8 chapter to know how to repeat instructions *N* times, but that *for* and *while* loops are tools that
are very useful for very specific situations that will be discussed at that point.
In the second part of the book, we are going to move from using R to solve statistical problems to
developing with R. We are going to learn about creating your own package. If you do not know what
packages are, don't worry, this will be discussed just below.
## What is RStudio? {-}
RStudio is a modern IDE that makes writing R code easier. The first thing we are going to learn is
how to use it.
R and RStudio are both open source: this means that the source code is freely available on
the internet and contributions by anyone are welcome and integrated; provided they are meaningful
and useful.
## What to expect from this book? {-}
The idea of Chapters 1 to 7 is to make you efficient with R as quickly as possible, especially if
you already have prior programming knowledge. Starting with Chapter 8 you will learn more advanced
topics, especially programming with R. R is a programming language, and you can't write
"programming language" without "language". And just as you wouldn't expect to learn
French, Portuguese or Icelandic by reading a single book, you shouldn't expect to become fluent in R
by reading a single book, not even by reading 10 books. Programming is an art which requires a lot of
practice. [Teach yourself programming in 10 years](http://www.norvig.com/21-days.html) is a blog
post written by Peter Norvig which explains that just as with any craft, mastering programming
takes time. And even if you don't need or want to become an expert in R, if you wish to use R
effectively and in a way that ultimately saves you time, you need to have some fluency in it, and
this only comes by continuing to learn about the language, and most importantly practicing. If you
keep using R every day, you'll definitely become very fluent. To stay informed about developments of
the language, and the latest news, I advise you read blogs, especially
[R-bloggers](https://www.r-bloggers.com/) which aggregates blog posts by more than 750 blogs
discussing R.
So what you can expect from this book is that this book is not the only one you should read.
## Prerequisites {-}
R and RStudio are the two main pieces of software that we are going to use. R is the programming
language and RStudio is a modern IDE for it. You can use R without RStudio; but you cannot use
RStudio without R.
If you wish to install R and RStudio at home to follow the examples in this book you can do it as
both pieces of software are available free of charge (paid options for RStudio exist, for companies
that need technical support). Installation is simple, but operating system dependent. To download
and install R for Windows, follow [this link](https://cloud.r-project.org/bin/windows/base/).
For macOS, follow [this one](https://cloud.r-project.org/bin/macosx/). If you run a GNU+Linux
distribution, you can install R using the system's package manager. If you're running Ubuntu, you
might want to take a look at [r2u](https://github.com/eddelbuettel/r2u), which provides very
fast installation of packages, full integration with `apt` (so dependencies get solved automatically)
and covers the entirety of CRAN.
For RStudio, look for your operating system [here](https://www.rstudio.com/products/rstudio/download/#download).
## What are packages? {-}
There is one more step; we are going to install some packages. Packages are additional pieces of
code that can be installed from within R with the following function: `install.packages()`. These
packages extend R's capabilities significantly, and are probably one of the main reasons R is so
popular. As of November 2018, R has over 13000 packages.
To install the packages we need, first open RStudio and then copy and paste this line in the console:
```{r, eval=FALSE}
install.packages(c("tidyverse", "rsample", "recipes", "blogdown" ,"yardstick", "parsnip", "plm", "pwt9",
"checkpoint", "Ecdat", "ggthemes", "ggfortify", "margins", "janitor", "rio", "stopwords",
"colourpicker", "glmnet", "lhs", "mlrMBO", "mlbench", "ranger"))
```
```{r, echo=FALSE}
knitr::include_graphics("pics/install_packages.png")
```
or go to the **Packages** pane and then click on *Install*:
```{r, echo=FALSE}
knitr::include_graphics("pics/rstudio_install_packages.gif")
```
## The author {-}
My name is Bruno Rodrigues and I program almost exclusively in R and have been teaching some R
courses for a few years now. I first started teaching for students at the University of Strasbourg
while working on my PhD. I hold a PhD in economics, with a focus on quantitative methods.
I'm currently head of the statistics department of the Ministry of Higher education and Research
in Luxembourg, and before that worked as a manager in the data science team of PWC Luxembourg.
This book is an adaptation of notes I've used in the past during my time as a teacher, but also
a lot of things I've learned about R since I left academia.
In my free time I like cooking, working out and [blogging](https://www.brodrigues.co), while listening to
[Fip](http://www.fipradio.fr/player) or
[Chillsky Radio](https://chillsky.com/listen/).
I also like to get my butt handed to me by playing roguelikes
such as [NetHack](http://nethack.wikia.com/wiki/NetHack), for which I wrote a
[package](https://github.com/b-rodrigues/nethack) that contains functions to analyze the data that
is saved on your computer after you win or lose (it will be lose 99% of the time) the game.
You can follow me on [twitter](https://www.twitter.com/brodriguesco), I tweet mostly about R or
what's happening in Luxembourg.