-
Notifications
You must be signed in to change notification settings - Fork 32
/
07-defining_your_own_functions.Rmd
1195 lines (888 loc) · 36.4 KB
/
07-defining_your_own_functions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Defining your own functions
In this section we are going to learn some advanced concepts that are going to make you into a
full-fledged R programmer. Before this chapter you only used whatever R came with, as well as the
functions contained in packages. We did define some functions ourselves in Chapter 6 already, but
without going into many details. In this chapter, we will learn about building functions ourselves,
and do so in greater detail than what we did before.
## Control flow
Knowing about control flow is essential to build your own functions. Without control flow statements,
such as if-else statements or loops (or, in the case of pure functional programming languages, recursion),
programming languages would be very limited.
### If-else
Imagine you want a variable to be equal to a certain value if a condition is met. This is a typical
problem that requires the `if ... else ...` construct. For instance:
```{r}
a <- 4
b <- 5
```
Suppose that if `a > b` then `f` should be equal to 20, else `f` should be equal to 10. Using ```if ... else
...``` you can achieve this like so:
```{r}
if (a > b) {
f <- 20
} else {
f <- 10
}
```
Obviously, here `f = 10`. Another way to achieve this is by using the `ifelse()` function:
```{r}
f <- ifelse(a > b, 20, 10)
```
`if...else...` and `ifelse()` might seem interchangeable, but they're not. `ifelse()` is vectorized, while
`if...else..` is not. Let's try the following:
```{r}
ifelse(c(1,2,4) > c(3, 1, 0), "yes", "no")
```
The result is a vector. Now, let's see what happens if we use `if...else...` instead of `ifelse()`:
```{r, eval = F}
if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no")
```
```{r, eval = F}
> Error in if (c(1, 2, 4) > c(3, 1, 0)) print("yes") else print("no") :
the condition has length > 1
```
This results in an error (in previous R version, only the first element of the vector would get used).
We have already discussed this in Chapter 2, remember? If you want to make sure that such an expression
evaluates to `TRUE`, then you need to use `all()`:
```{r}
ifelse(all(c(1,2,4) > c(3, 1, 0)), "all elements are greater", "not all elements are greater")
```
You may also remember the `any()` function:
```{r}
ifelse(any(c(1,2,4) > c(3, 1, 0)), "at least one element is greater", "no element greater")
```
These are the basics. But sometimes, you might need to test for more complex conditions, which can
lead to using nested `if...else...` constructs. These, however, can get messy:
```{r}
if (10 %% 3 == 0) {
print("10 is divisible by 3")
} else if (10 %% 2 == 0) {
print("10 is divisible by 2")
}
```
10 being obviously divisible by 2 and not 3, it is the second sentence that will be printed. The
`%%` operator is the modulus operator, which gives the rest of the division of 10 by 2. In such
cases, it is easier to use `dplyr::case_when()`:
```{r}
case_when(10 %% 3 == 0 ~ "10 is divisible by 3",
10 %% 2 == 0 ~ "10 is divisible by 2")
```
We have already encountered this function in Chapter 4, inside a `dplyr::mutate()` call to create a new column.
Let's now discuss loops.
### For loops
For loops make it possible to repeat a set of instructions `i` times. For example, try the following:
```{r}
for (i in 1:10){
print("hello")
}
```
It is also possible to do computations using for loops. Let's compute the sum of the first
100 integers:
```{r}
result <- 0
for (i in 1:100){
result <- result + i
}
print(result)
```
`result` is equal to 5050, the expected result. What happened in that loop? First, we defined a
variable called `result` and set it to 0. Then, when the loops starts, `i` equals 1, so we add
`result` to `1`, which is 1. Then, `i` equals 2, and again, we add `result` to `i`. But this time,
`result` equals 1 and `i` equals 2, so now `result` equals 3, and we repeat this until `i`
equals 100. If you know a programming language like C, this probably looks familiar. However, R is
not C, and you should, if possible, avoid writing code that looks like this. You should always
ask yourself the following questions:
- Is there an inbuilt function to achieve what I need? In this case we have `sum()`, so we could use `sum(seq(1, 100))`.
- Is there a way to use matrix algebra? This can sometimes make things easier, but it depends how comfortable
you are with matrix algebra. This would be the solution with matrix algebra: `rep(1, 100) %*% seq(1, 100)`.
- Is there a way to use building blocks that are already available? For instance, suppose that `sum()`
would not be a function available in R. Another way to solve this issue would be to use the following
building blocks: `+`, which computes the sum of two numbers and `Reduce()`, which *reduces* a list
of elements using an operator. Sounds complicated? Let's see how `Reduce()` works. First, let me show you how
I combine these two functions to achieve the same result as when using `sum()`:
```{r}
Reduce(`+`, seq(1, 100))
```
We will see how `Reduce()` works in greater detail in the next chapter, but what happened was something like this:
```
Reduce(`+`, seq(1, 100)) =
1 + Reduce(`+`, seq(2, 100)) =
1 + 2 + Reduce(`+`, seq(3, 100)) =
1 + 2 + 3 + Reduce(`+`, seq(4, 100)) =
....
```
If you ask yourself these questions, it turns out that you only rarely actually need to write loops, but loops are
still important, because sometimes there simply isn't an alternative. Also, there are other situations where loops
are also important, so I refer you to the following [section](http://adv-r.had.co.nz/Functionals.html#functionals-not)
of Hadley Wickham's *Advanced R* for an in-depth discussion on situations where loops make more
sense than using functions such as `Reduce()`.
### While loops
While loops are very similar to for loops. The instructions inside a while loop are repeated while a
certain condition holds true. Let's consider the sum of the first 100 integers again:
```{r}
result <- 0
i <- 1
while (i<=100){
result = result + i
i = i + 1
}
print(result)
```
Here, we first set `result` and `i` to 0. Then, while `i` is less than, or equal to 100, we add `i`
to `result`. Notice that there is one more line than in the for loop version of this code: we need
to increment the value of `i` at each iteration, if not, `i` would stay equal to 1, and the
condition would always be fulfilled, and the loop would run forever (not really, only until your
computer runs out of memory, or until the heat death of the universe, whichever comes first).
Now that we know how to write loops, and know about `if...else...` constructs, we have (almost) all
the ingredients to write our own functions.
## Writing your own functions
As you have seen by now, R includes a very large amount of in-built functions, but also many
more functions are available in packages. However, there will be a lot of situations where you will
need to write your own. In this section we are going to learn how to write our own functions.
### Declaring functions in R
Suppose you want to create the following function: \(f(x) = \dfrac{1}{\sqrt{x}}\).
Writing this in R is quite simple:
```{r}
my_function <- function(x){
1/sqrt(x)
}
```
The argument of the function, `x`, gets passed to the `function()` function and the *body* of
the function (more on that in the next Chapter) contains the function definition. Of course,
you could define functions that use more than one input:
```{r}
my_function <- function(x, y){
1/sqrt(x + y)
}
```
or inputs with names longer than one character:
```{r}
my_function <- function(argument1, argument2){
1/sqrt(argument1 + argument2)
}
```
Functions written by the user get called just the same way as functions included in R:
```{r}
my_function(1, 10)
```
It is also possible to provide default values to the function's arguments, which are values that are used
if the user omits them:
```{r}
my_function <- function(argument1, argument2 = 10){
1/sqrt(argument1 + argument2)
}
```
```{r}
my_function(1)
```
This is especially useful for functions with many arguments. Consider also the following example,
where the function has a default method:
```{r}
my_function <- function(argument1, argument2, method = "foo"){
x <- argument1 + argument2
if(method == "foo"){
1/sqrt(x)
} else if (method == "bar"){
"this is a string"
}
}
my_function(10, 11)
my_function(10, 11, "bar")
```
As you see, depending on the "method" chosen, the returned result is either a numeric, or a string.
What happens if the user provides a "method" that is neither "foo" nor "bar"?
```{r}
my_function(10, 11, "spam")
```
As you can see nothing happens. It is possible to add safeguards to your function to avoid such
situations:
```{r}
my_function <- function(argument1, argument2, method = "foo"){
if(!(method %in% c("foo", "bar"))){
return("Method must be either 'foo' or 'bar'")
}
x <- argument1 + argument2
if(method == "foo"){
1/sqrt(x)
} else if (method == "bar"){
"this is a string"
}
}
my_function(10, 11)
my_function(10, 11, "bar")
my_function(10, 11, "foobar")
```
Notice that I have used `return()` inside my first `if` statement. This is to immediately stop
evaluation of the function and return a value. If I had omitted it, evaluation would have
continued, as it is always the last expression that gets evaluated. Remove `return()` and run the
function again, and see what happens. Later, we are going to learn how to add better safeguards to
your functions and to avoid runtime errors.
While in general, it is a good idea to add comments to your functions to explain what they do, I
would avoid adding comments to functions that do things that are very obvious, such as with this
one. Function names should be of the form: `function_name()`. Always give your function very
explicit names! In mathematics it is standard to give functions just one letter as a name, but I
would advise against doing that in your code. Functions that you write are not special in any way;
this means that R will treat them the same way, and they will work in conjunction with any other
function just as if it was built-in into R.
They have one limitation though (which is shared with R's native function): just like in math,
they can only return one value. However, sometimes, you may need to return more than one value.
To be able to do this, you must put your values in a list, and return the list of values. For example:
```{r}
average_and_sd <- function(x){
c(mean(x), sd(x))
}
average_and_sd(c(1, 3, 8, 9, 10, 12))
```
You're still returning a single object, but it's a vector. You can also return a named list:
```{r}
average_and_sd <- function(x){
list("mean_x" = mean(x), "sd_x" = sd(x))
}
average_and_sd(c(1, 3, 8, 9, 10, 12))
```
As described before, you can use `return()` at the end of your functions:
```{r}
average_and_sd <- function(x){
result <- c(mean(x), sd(x))
return(result)
}
average_and_sd(c(1, 3, 8, 9, 10, 12))
```
But this is only needed if you need to return a value early:
```{r}
average_and_sd <- function(x){
if(any(is.na(x))){
return(NA)
} else {
c(mean(x), sd(x))
}
}
average_and_sd(c(1, 3, 8, 9, 10, 12))
average_and_sd(c(1, 3, NA, 9, 10, 12))
```
If you need to use a function from a package inside your function use `::`:
```{r}
my_sum <- function(a_vector){
purrr::reduce(a_vector, `+`)
}
```
However, if you need to use more than one function, this can become tedious. A quick and dirty
way of doing that, is to use `library(package_name)`, inside the function:
```{r}
my_sum <- function(a_vector){
library(purrr)
reduce(a_vector, `+`)
}
```
Loading the library inside the function has the advantage that you will be sure that the package
upon which your function depends will be loaded. If the package is already loaded, it will not be
loaded again, thus not impact performance, but if you forgot to load it at the beginning of your
script, then, no worries, your function will load it the first time you use it! However, you should
avoid doing this, because the resulting function is now not pure. It has a side effect, which is
loading a library. This could result in problems, especially if several functions load several
different packages that have functions with the same name. Depending on which function runs first,
a function with the same name but coming from the same package will be available in the global
environment. The very best way would be to write your own package and declare the packages upon
which your functions depend as dependencies. This is something we are going to explore in Chapter
9.
You can put a lot of instructions inside a function, such as loops. Let's create the function that
returns Fionacci numbers.
### Fibonacci numbers
The Fibonacci sequence is the following:
$$1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...$$
Each subsequent number is composed of the sum of the two preceding ones. In R, it is possible to define a function that returns the $n^{th}$ fibonacci number:
```{r}
my_fibo <- function(n){
a <- 0
b <- 1
for (i in 1:n){
temp <- b
b <- a
a <- a + temp
}
a
}
```
Inside the loop, we defined a variable called `temp`. Defining temporary variables is usually very
useful. Let's try to understand what happens inside this loop:
* First, we assign the value 0 to variable `a` and value 1 to variable `b`.
* We start a loop, that goes from 1 to `n`.
* We assign the value inside of `b` to a temporary variable, called `temp`.
* `b` becomes `a`.
* We assign the sum of `a` and `temp` to `a`.
* When the loop is finished, we return `a`.
What happens if we want the 3rd fibonacci number? At `n = 1` we have first `a = 0` and `b = 1`,
then `temp = 1`, `b = 0` and `a = 0 + 1`. Then `n = 2`. Now `b = 0` and `temp = 0`. The previous
result, `a = 0 + 1` is now assigned to `b`, so `b = 1`. Then, `a = 1 + 0`. Finally, `n = 3`. `temp
= 1` (because `b = 1`), the previous result `a = 1` is assigned to `b` and finally, `a = 1 + 1`. So
the third fibonacci number equals 2. Reading this might be a bit confusing; I strongly advise you
to run the algorithm on a sheet of paper, step by step.
The above algorithm is called an iterative algorithm, because it uses a loop to compute the result.
Let's look at another way to think about the problem, with a so-called recursive function:
```{r}
fibo_recur <- function(n){
if (n == 0 || n == 1){
return(n)
} else {
fibo_recur(n-1) + fibo_recur(n-2)
}
}
```
This algorithm should be easier to understand: if `n = 0` or `n = 1` the function should return `n`
(0 or 1). If `n` is strictly bigger than `1`, `fibo_recur()` should return the sum of
`fibo_recur(n-1)` and `fibo_recur(n-2)`. This version of the function is very much the same as the
mathematical definition of the fibonacci sequence. So why not use only recursive algorithms
then? Try to run the following:
```{r}
system.time(my_fibo(30))
```
The result should be printed very fast (the `system.time()` function returns the time that it took
to execute `my_fibo(30)`). Let's try with the recursive version:
```{r}
system.time(fibo_recur(30))
```
It takes much longer to execute! Recursive algorithms are very CPU demanding, so if speed is
critical, it's best to avoid recursive algorithms. Also, in `fibo_recur()` try to remove this line:
`if (n == 0 || n == 1)` and try to run `fibo_recur(5)` and see what happens. You should
get an error: this is because for recursive algorithms you need a stopping condition, or else,
it would run forever. This is not the case for iterative algorithms, because the stopping
condition is the last step of the loop.
So as you can see, for recursive relationships, for or while loops are the way to go in R, whether
you're writing these loops inside functions or not.
## Exercises
### Exercise 1 {-}
In this exercise, you will write a function to compute the sum of the n first integers. Combine the
algorithm we saw in section about while loops and what you learned about functions
in this section.
```{r, include=FALSE}
MySum <- function(n){
result = 0
i = 1
while (i<=n){
result = result + i
i = i + 1
}
result
}
```
### Exercise 2 {-}
Write a function called `my_fact()` that computes the factorial of a number `n`. Do it using a
loop, using a recursive function, and using a functional:
```{r, include=FALSE}
my_fact_iter <- function(n){
result = 1
for(i in 1:n){
result = result * i
i = i + 1
}
result
}
my_fact_recur <- function(n){
if(n == 0 || n == 1){
result = 1
} else {
n * MyFactorialRecur(n-1)
}
}
my_fact_reduce <- function(n){
reduce(seq(1, n), `*`)
}
```
### Exercise 3 {-}
Write a function to find the roots of quadratic functions. Your function should take 3 arguments,
`a`, `b` and `c` and return the two roots. Only consider the case where there are two real roots
(delta > 0).
```{r, include=FALSE}
quad_root <- function(a, b, c){
# function that returns the root of a quadratic function
# very basic, doesn't cover the case where delta < 0
delta = b**2 - 4 * a * c
x1 = (-b + sqrt(delta)) / (2 * a)
x2 = (-b - sqrt(delta)) / (2 * a)
c(x1, x2)
}
quad_root(1, -4, 3) # should return 3 and 1
```
## Functions that take functions as arguments: writing your own higher-order functions
Functions that take functions as arguments are very powerful and useful tools.
Two very important functions, that we will discuss in chapter 8 are `purrr::map()`
and `purrr::reduce()`. But you can also write your own! A very simple example
would be the following:
```{r}
my_func <- function(x, func){
func(x)
}
```
`my_func()` is a very simple function that takes `x` and `func()` as arguments and that simply
executes `func(x)`. This might not seem very useful (after all, you could simply use `func(x)!`) but
this is just for illustration purposes, in practice, your functions would be more useful than that!
Let's try to use `my_func()`:
```{r}
my_func(c(1, 8, 1, 0, 8), mean)
```
As expected, this returns the mean of the given vector. But now suppose the following:
```{r}
my_func(c(1, 8, 1, NA, 8), mean)
```
Because one element of the list is `NA`, the whole mean is `NA`. `mean()` has a `na.rm` argument
that you can set to `TRUE` to ignore the `NA`s in the vector. However, here, there is no way to
provide this argument to the function `mean()`! Let's see what happens when we try to:
```{r, eval=FALSE}
my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE)
```
```
Error in my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE) :
unused argument (na.rm = TRUE)
```
So what you could do is pass the value `TRUE` to the `na.rm` argument of `mean()` from your own
function:
```{r}
my_func <- function(x, func, remove_na){
func(x, na.rm = remove_na)
}
my_func(c(1, 8, 1, NA, 8), mean, remove_na = TRUE)
```
This is one solution, but `mean()` also has another argument called `trim`. What if some other
user needs this argument? Should you also add it to your function? Surely there's a way to avoid
this problem? Yes, there is, and it by using the *dots*. The `...` simply mean "any other
argument as needed", and it's very easy to use:
```{r}
my_func <- function(x, func, ...){
func(x, ...)
}
my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE)
```
or, now, if you need the `trim` argument:
```{r}
my_func(c(1, 8, 1, NA, 8), mean, na.rm = TRUE, trim = 0.1)
```
The `...` are very useful when writing higher-order functions such as `my_func()`, because it allows
you to pass arguments *down* to the underlying functions.
## Functions that return functions
The example from before, `my_func()` took three arguments, some `x`, a function `func`, and `...` (dots). `my_func()`
was a kind of wrapper that evaluated `func` on its arguments `x` and `...`. But sometimes this is not quite what you
need or want. It is sometimes useful to write a function that returns a modified function. This type of function
is called a function factory, as it *builds* functions. For instance, suppose that we want to time how long functions
take to run. An idea would be to proceed like this:
```{r, eval = FALSE}
tic <- Sys.time()
very_slow_function(x)
toc <- Sys.time()
running_time <- toc - tic
```
but if you want to time several functions, this gets very tedious. It would be much easier if functions would
time *themselves*. We could achieve this by writing a wrapper, like this:
```{r, eval = FALSE}
timed_very_slow_function <- function(...){
tic <- Sys.time()
result <- very_slow_function(x)
toc <- Sys.time()
running_time <- toc - tic
list("result" = result,
"running_time" = running_time)
}
```
The problem here is that we have to change each function we need to time. But thanks to the concept of function
factories, we can write a function that does this for us:
```{r}
time_f <- function(.f, ...){
function(...){
tic <- Sys.time()
result <- .f(...)
toc <- Sys.time()
running_time <- toc - tic
list("result" = result,
"running_time" = running_time)
}
}
```
`time_f()` is a function that returns a function, a function factory. Calling it on a function returns, as expected,
a function:
```{r}
t_mean <- time_f(mean)
t_mean
```
This function can now be used like any other function:
```{r}
output <- t_mean(seq(-500000, 500000))
```
`output` is a list of two elements, the first being simply the result of `mean(seq(-500000, 500000))`, and the other
being the running time.
This approach is super flexible. For instance, imagine that there is an `NA` in the vector. This would result in
the mean of this vector being `NA`:
```{r}
t_mean(c(NA, seq(-500000, 500000)))
```
But because we use the `...` in the definition of `time_f()`, we can now simply pass `mean()`'s option down to it:
```{r}
t_mean(c(NA, seq(-500000, 500000)), na.rm = TRUE)
```
## Functions that take columns of data as arguments
### The `enquo() - !!()` approach
In many situations, you will want to write functions that look similar to this:
```{r, eval=FALSE}
my_function(my_data, one_column_inside_data)
```
Such a function would be useful in situation where you have to apply a certain number of operations
to columns for different data frames. For example if you need to create tables of descriptive
statistics or graphs periodically, it might be very interesting to put these operations inside a
function and then call the function whenever you need it, on the fresh batch of data.
However, if you try to write something like that, something that might seem unexpected, at first,
will happen:
```{r, eval=FALSE}
data(mtcars)
simple_function <- function(dataset, col_name){
dataset %>%
group_by(col_name) %>%
summarise(mean_speed = mean(speed))
}
simple_function(cars, "dist")
```
```
Error: unknown variable to group by : col_name
```
The variable `col_name` is passed to `simple_function()` as a string, but `group_by()` requires a
variable name. So why not try to convert `col_name` to a name?
```{r, eval=FALSE}
simple_function <- function(dataset, col_name){
col_name <- as.name(col_name)
dataset %>%
group_by(col_name) %>%
summarise(mean_speed = mean(speed))
}
simple_function(cars, "dist")
```
```
Error: unknown variable to group by : col_name
```
This is because R is literally looking for the variable `"dist"` somewhere in the global
environment, and not as a column of the data. R does not understand that you are refering to the
column `"dist"` that is inside the dataset. So how can we make R understands what you mean?
To be able to do that, we need to use a framework that was introduced in the `{tidyverse}`,
called *tidy evaluation*. This framework can be used by installing the `{rlang}` package.
`{rlang}` is quite a technical package, so I will spare you the details. But you should at
the very least take a look at the following documents
[here](http://dplyr.tidyverse.org/articles/programming.html) and
[here](https://rlang.r-lib.org/reference/topic-data-mask.html). The
discussion can get complicated, but you don't need to know everything about `{rlang}`.
As you will see, knowing some of the capabilities `{rlang}` provides can be incredibly useful.
Take a look at the code below:
```{r}
simple_function <- function(dataset, col_name){
col_name <- enquo(col_name)
dataset %>%
group_by(!!col_name) %>%
summarise(mean_mpg = mean(mpg))
}
simple_function(mtcars, cyl)
```
As you can see, the previous idea we had, which was using `as.name()` was not very far away from
the solution. The solution, with `{rlang}`, consists in using `enquo()`, which (for our purposes),
does something similar to `as.name()`. Now that `col_name` is (R programmers call it) quoted, or
*defused*, we need to tell `group_by()` to evaluate the input as is. This is done with `!!()`,
called the [injection operator](https://rlang.r-lib.org/reference/injection-operator.html), which
is another `{rlang}` function. I say it again; don't worry if you don't understand everything. Just
remember to use `enquo()` on your column names and then `!!()` inside the `{dplyr}` function you
want to use.
Let's see some other examples:
```{r}
simple_function <- function(dataset, col_name, value){
col_name <- enquo(col_name)
dataset %>%
filter((!!col_name) == value) %>%
summarise(mean_cyl = mean(cyl))
}
simple_function(mtcars, am, 1)
```
Notice that I’ve written:
```{r, eval=FALSE}
filter((!!col_name) == value)
```
and not:
```{r, eval=FALSE}
filter(!!col_name == value)
```
I have enclosed `!!col_name` inside parentheses. This is because operators such as `==` have
precedence over `!!`, so you have to be explicit. Also, notice that I didn't have to quote `1`.
This is because it's *standard* variable, not a column inside the dataset. Let’s make this function
a bit more general. I hard-coded the variable cyl inside the body of the function, but maybe you’d
like the mean of another variable?
```{r}
simple_function <- function(dataset, filter_col, mean_col, value){
filter_col <- enquo(filter_col)
mean_col <- enquo(mean_col)
dataset %>%
filter((!!filter_col) == value) %>%
summarise(mean((!!mean_col)))
}
simple_function(mtcars, am, cyl, 1)
```
Notice that I had to quote `mean_col` too.
Using the `...` that we discovered in the previous section, we can pass more than one column:
```{r}
simple_function <- function(dataset, ...){
col_vars <- quos(...)
dataset %>%
summarise_at(vars(!!!col_vars), funs(mean, sd))
}
```
Because these *dots* contain more than one variable, you have to use `quos()` instead of `enquo()`.
This will put the arguments provided via the dots in a list. Then, because we have a list of
columns, we have to use `summarise_at()`, which you should know if you did the exercices of
Chapter 4. So if you didn't do them, go back to them and finish them first. Doing the exercise will
also teach you what `vars()` and `funs()` are. The last thing you have to pay attention to is to
use `!!!()` if you used `quos()`. So 3 `!` instead of only 2. This allows you to then do things
like this:
```{r}
simple_function(mtcars, am, cyl, mpg)
```
Using `...` with `!!!()` allows you to write very flexible functions.
If you need to be even more general, you can also provide the summary functions as arguments of
your function, but you have to rewrite your function a little bit:
```{r}
simple_function <- function(dataset, cols, funcs){
dataset %>%
summarise_at(vars(!!!cols), funs(!!!funcs))
}
```
You might be wondering where the `quos()` went? Well because now we are passing two lists, a list of
columns that we have to quote, and a list of functions, that we also have to quote, we need to use `quos()`
when calling the function:
```{r}
simple_function(mtcars, quos(am, cyl, mpg), quos(mean, sd, sum))
```
This works, but I don't think you'll need to have that much flexibility; either the columns
are variables, or the functions, but rarely both at the same time.
To conclude this function, I should also talk about `as_label()` which allows you to change the
name of a variable, for instance if you want to call the resulting column `mean_mpg` when you
compute the mean of the `mpg` column:
```{r}
simple_function <- function(dataset, filter_col, mean_col, value){
filter_col <- enquo(filter_col)
mean_col <- enquo(mean_col)
mean_name <- paste0("mean_", as_label(mean_col))
dataset %>%
filter((!!filter_col) == value) %>%
summarise(!!(mean_name) := mean((!!mean_col)))
}
```
Pay attention to the `:=` operator in the last line. This is needed when using `as_label()`.
### Curly Curly, a simplified approach to `enquo()` and `!!()`
The previous section might have been a bit difficult to grasp, but there is a simplified way of doing it,
which consists in using `{{}}`, introduced in `{rlang}` version 0.4.0.
The suggested pronunciation of `{{}}` is *curly-curly*, but there is no
[consensus yet](https://twitter.com/JonTheGeek/status/1144815369766547456).
Let's suppose that I need to write a function that takes a data frame, as well as a column from
this data frame as arguments, just like before:
```{r}
how_many_na <- function(dataframe, column_name){
dataframe %>%
filter(is.na(column_name)) %>%
count()
}
```
Let's try this function out on the `starwars` data:
```{r}
data(starwars)
head(starwars)
```
As you can see, there are missing values in the `hair_color` column. Let's try to count how many
missing values are in this column:
```{r, eval=FALSE}
how_many_na(starwars, hair_color)
```
```
Error: object 'hair_color' not found
```
Just as expected, this does not work. The issue is that the column is inside the dataframe,
but when calling the function with `hair_color` as the second argument, R is looking for a
variable called `hair_color` that does not exist. What about trying with `"hair_color"`?
```{r}
how_many_na(starwars, "hair_color")
```
Now we get something, but something wrong!
One way to solve this issue, is to not use the `filter()` function, and instead rely on base R:
```{r}
how_many_na_base <- function(dataframe, column_name){
na_index <- is.na(dataframe[, column_name])
nrow(dataframe[na_index, column_name])
}
how_many_na_base(starwars, "hair_color")
```
This works, but not using the `{tidyverse}` at all is not always an option. For instance,
the next function, which uses a grouping variable, would be difficult to implement without the
`{tidyverse}`:
```{r}
summarise_groups <- function(dataframe, grouping_var, column_name){
dataframe %>%
group_by(grouping_var) %>%
summarise(mean(column_name, na.rm = TRUE))
}
```
Calling this function results in the following error message, as expected:
```
Error: Column `grouping_var` is unknown
```
In the previous section, we solved the issue like so:
```{r}
summarise_groups <- function(dataframe, grouping_var, column_name){
grouping_var <- enquo(grouping_var)
column_name <- enquo(column_name)
mean_name <- paste0("mean_", as_label(column_name))
dataframe %>%
group_by(!!grouping_var) %>%
summarise(!!(mean_name) := mean(!!column_name, na.rm = TRUE))
}
```
The core of the function remained very similar to the version from before, but now one has to
use the `enquo()`-`!!` syntax.
Now this can be simplified using the new `{{}}` syntax:
```{r}
summarise_groups <- function(dataframe, grouping_var, column_name){
dataframe %>%
group_by({{grouping_var}}) %>%
summarise({{column_name}} := mean({{column_name}}, na.rm = TRUE))
}
```
Much easier and cleaner! You still have to use the `:=` operator instead of `=` for the column name
however, and if you want to modify the column names, for instance in this
case return `"mean_height"` instead of `height` you have to keep using the `enquo()`-`!!` syntax.
## Functions that use loops