-
Notifications
You must be signed in to change notification settings - Fork 118
/
26_Deprecated_exercises.Rmd
executable file
·2646 lines (2007 loc) · 95 KB
/
26_Deprecated_exercises.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, include=FALSE}
source("common.R")
```
# (PART) Miscellaneous {-}
# Deprecated
## Conditions
__[Q1]{.Q}__: What does `options(error = recover)` do? Why might you use it?
__[A]{.solved}__: In case of `options(error = recover)` `utils::recover()` will be called (without arguments) in case of an error. This will print out a list of calls which precede the error and lets the user choose to incorporate `browser()` directly in any of the regarding environments allowing a practical mode for debugging.
__[Q2]{.Q}__: What does `options(error = quote(dump.frames(to.file = TRUE)))` do? Why might you use it?
__[A]{.solved}__: This option writes a dump of the evaluation environment where an error occurs into a file ending on `.rda`. When this option is set, R will continue to run after the first error. To stop R at the first error use `quote({dump.frames(to.file=TRUE); q()})`. These options are especially useful for debugging non-interactive R scripts afterwards ("post mortem debugging").
## Expressions (new)
1. __[Q]{.Q}__: `base::alist()` is useful for creating pairlists to be used for function arguments:
```{r}
foo <- function() {}
formals(foo) <- alist(x = , y = 1)
foo
```
What makes `alist()` special compared to `list()`?
__[A]{.solved}__: From `?alist`:
> alist handles its arguments as if they described function arguments. So the values are not evaluated, and tagged arguments with no value are allowed whereas list simply ignores them. alist is most often used in conjunction with formals.
## Functionals
### My first functional: `lapply()`
1. __[Q]{.Q}__: Why are the following two invocations of `lapply()` equivalent?
```{r, eval = FALSE}
trims <- c(0, 0.1, 0.2, 0.5)
x <- rcauchy(100)
lapply(trims, function(trim) mean(x, trim = trim))
lapply(trims, mean, x = x)
```
__[A]{.solved}__: In the first statement each element of `trims` is explicitly supplied to `mean()`'s second argument. In the latter statement this happens via
positional matching, since `mean()`'s first argument is supplied via name
in `lapply()`'s third argument (`...`).
2. __[Q]{.Q}__: The function below scales a vector so it falls in the range [0, 1]. How
would you apply it to every column of a data frame? How would you apply it
to every numeric column in a data frame?
```{r}
scale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
```
__[A]{.solved}__: Since this function needs numeric input, one can check this via an if clause. If one also wants to return non-numeric input columns, these can be supplied to the `else` argument of the `if()` "function":
```{r, eval = FALSE}
data.frame(lapply(mtcars, function(x) if (is.numeric(x)) scale01(x) else x))
```
3. __[Q]{.Q}__: Use both for loops and `lapply()` to fit linear models to the
`mtcars` using the formulas stored in this list:
```{r}
formulas <- list(
mpg ~ disp,
mpg ~ I(1 / disp),
mpg ~ disp + wt,
mpg ~ I(1 / disp) + wt
)
```
__[A]{.solved}__: Like in the first exercise, we can create two `lapply()` versions:
```{r, eval = TRUE}
# lapply (2 versions)
la1 <- lapply(formulas, lm, data = mtcars)
la2 <- lapply(formulas, function(x) lm(formula = x, data = mtcars))
# for loop
lf1 <- vector("list", length(formulas))
for (i in seq_along(formulas)){
lf1[[i]] <- lm(formulas[[i]], data = mtcars)
}
```
Note that all versions return the same content, but they won't be identical, since the values of the "call" element will differ between each version.
4. __[Q]{.Q}__: Fit the model `mpg ~ disp` to each of the bootstrap replicates of `mtcars`
in the list below by using a for loop and `lapply()`. Can you do it
without an anonymous function?
```{r, eval = TRUE}
bootstraps <- lapply(1:10, function(i) {
rows <- sample(1:nrow(mtcars), rep = TRUE)
mtcars[rows, ]
})
```
__[A]{.solved}__:
```{r, eval = TRUE}
# lapply without anonymous function
la <- lapply(bootstraps, lm, formula = mpg ~ disp)
# for loop
lf <- vector("list", length(bootstraps))
for (i in seq_along(bootstraps)){
lf[[i]] <- lm(mpg ~ disp, data = bootstraps[[i]])
}
```
5. __[Q]{.Q}__: For each model in the previous two exercises, extract $R^2$ using the
function below.
```{r, eval = TRUE}
rsq <- function(mod) summary(mod)$r.squared
```
__[A]{.solved}__: For the models in exercise 3:
```{r, eval = TRUE}
sapply(la1, rsq)
sapply(la2, rsq)
sapply(lf1, rsq)
```
And the models in exercise 4:
```{r, eval = TRUE}
sapply(la, rsq)
sapply(lf, rsq)
```
### For loops functionals: friends of lapply():
1. __[Q]{.Q}__: Use `vapply()` to:
a) Compute the standard deviation of every column in a numeric data frame.
a) Compute the standard deviation of every numeric column in a mixed data
frame. (Hint: you'll need to use `vapply()` twice.)
__[A]{.solved}__: As a numeric `data.frame` we choose `cars`:
```{r, eval = FALSE}
vapply(cars, sd, numeric(1))
```
And as a mixed `data.frame` we choose `mtcars`:
```{r, eval = FALSE}
vapply(mtcars[vapply(mtcars, is.numeric, logical(1))],
sd,
numeric(1))
```
2. __[Q]{.Q}__: Why is using `sapply()` to get the `class()` of each element in
a data frame dangerous?
__[A]{.solved}__: Columns of data.frames might have more than one class, so the class of `sapply()`'s output may differ from time to time (silently). If ...
* all columns have one class: `sapply()` returns a character vector
* one column has more classes than the others: `sapply()` returns a list
* all columns have the same number of classes, which is more than one: `sapply()` returns a matrix
For example:
```{r}
a <- letters[1:3]
class(a) <- c("class1", "class2")
df <- data.frame(a = character(3))
df$a <- a
df$b <- a
class(sapply(df, class))
```
Note that this case often appears, wile working with the POSIXt types, POSIXct and POSIXlt.
3. __[Q]{.Q}__: The following code simulates the performance of a t-test for non-normal
data. Use `sapply()` and an anonymous function to extract the p-value from
every trial.
```{r}
trials <- replicate(
100,
t.test(rpois(10, 10), rpois(7, 10)),
simplify = FALSE
)
```
Extra challenge: get rid of the anonymous function by using `[[` directly.
__[A]{.solved}__:
```{r, eval = FALSE}
# anonymous function:
sapply(trials, function(x) x[["p.value"]])
# without anonymous function:
sapply(trials, "[[", "p.value")
```
4. __[Q]{.Q}__: What does `replicate()` do? What sort of for loop does it eliminate? Why
do its arguments differ from `lapply()` and friends?
__[A]{.solved}__: As stated in `?replicate`:
> replicate is a wrapper for the common use of sapply for repeated evaluation of an expression (which will usually involve random number generation).
We can see this clearly in the source code:
```{r, echo = FALSE}
replicate
```
Like `sapply()` `replicate()` eliminates a for loop. As explained for `Map()` in the textbook, also every `replicate()` could have been written via `lapply()`. But using `replicate()` is more concise, and more clearly indicates what you're trying to do.
5. __[Q]{.Q}__: Implement a version of `lapply()` that supplies `FUN` with both the name
and the value of each component.
__[A]{.solved}__:
```{r, eval = TRUE}
lapply_nms <- function(X, FUN, ...){
Map(FUN, X, names(X), ...)
}
lapply_nms(mtcars, function(x, y) c(class(x), y))
```
6. __[Q]{.Q}__: Implement a combination of `Map()` and `vapply()` to create an `lapply()`
variant that iterates in parallel over all of its inputs and stores its
outputs in a vector (or a matrix). What arguments should the function
take?
__[A]{.solved}__ As we understand this exercise, it is about working with a list of lists, like in the following example:
```{r}
testlist <- list(mtcars, mtcars, cars)
lapply(testlist, function(x) vapply(x, mean, numeric(1)))
```
So we can get the same result with a more specialized function:
````{r}
lmapply <- function(X, FUN, FUN.VALUE, simplify = FALSE){
out <- Map(function(x) vapply(x, FUN, FUN.VALUE), X)
if(simplify == TRUE){return(simplify2array(out))}
out
}
lmapply(testlist, mean, numeric(1))
```
7. __[Q]{.Q}__: Implement `mcsapply()`, a multi-core version of `sapply()`. Can you
implement `mcvapply()`, a parallel version of `vapply()`? Why or why not?
### Manipulating matrices and data frames
1. __[Q]{.Q}__: How does `apply()` arrange the output? Read the documentation and perform
some experiments.
__[A]{.solved}__:
`apply()` arranges its output columns (or list elements) according to the order of the margin.
The rows are ordered by the other dimensions, starting with the "last" dimension
of the input object. What this means should become clear by looking at the three and four dimensional cases of the following example:
```{r, eval = FALSE}
# for two dimensional cases everything is sorted by the other dimension
arr2 <- array(1:9, dim = c(3, 3), dimnames = list(paste0("row", 1:3),
paste0("col", 1:3)))
arr2
apply(arr2, 1, head, 1) # Margin is row
apply(arr2, 1, head, 9) # sorts by col
apply(arr2, 2, head, 1) # Margin is col
apply(arr2, 2, head, 9) # sorts by row
# 3 dimensional
arr3 <- array(1:27, dim = c(3,3,3), dimnames = list(paste0("row", 1:3),
paste0("col", 1:3),
paste0("time", 1:3)))
arr3
apply(arr3, 1, head, 1) # Margin is row
apply(arr3, 1, head, 27) # sorts by time and col
apply(arr3, 2, head, 1) # Margin is col
apply(arr3, 2, head, 27) # sorts by time and row
apply(arr3, 3, head, 1) # Margin is time
apply(arr3, 3, head, 27) # sorts by col and row
# 4 dimensional
arr4 <- array(1:81, dim = c(3,3,3,3), dimnames = list(paste0("row", 1:3),
paste0("col", 1:3),
paste0("time", 1:3),
paste0("var", 1:3)))
arr4
apply(arr4, 1, head, 1) # Margin is row
apply(arr4, 1, head, 81) # sorts by var, time, col
apply(arr4, 2, head, 1) # Margin is col
apply(arr4, 2, head, 81) # sorts by var, time, row
apply(arr4, 3, head, 1) # Margin is time
apply(arr4, 3, head, 81) # sorts by var, col, row
apply(arr4, 4, head, 1) # Margin is var
apply(arr4, 4, head, 81) # sorts by time, col, row
```
2. __[Q]{.Q}__: There's no equivalent to `split()` + `vapply()`. Should there be? When
would it be useful? Implement one yourself.
__[A]{.solved}__: We can modify the `tapply2()` approach from the book, where `split()` and `sapply()` were combined:
```{r, eval = FALSE}
v_tapply <- function(x, group, f, FUN.VALUE, ..., USE.NAMES = TRUE) {
pieces <- split(x, group)
vapply(pieces, f, FUN.VALUE, ..., USE.NAMES = TRUE)
}
```
`tapply()` has a `SIMPLIFY` argument. When you set it to `FALSE`, `tapply()` will always return a list. It is easy to create cases where the length and the types/classes of the list elements vary depending on the input. The `vapply()` version could be useful, if you want to control the structure of the output to get an error according to some logic of a specific usecase or you want typestable output to build up other functions on top of it.
3. __[Q]{.Q}__: Implement a pure R version of `split()`. (Hint: use `unique()` and
subsetting.) Can you do it without a for loop?
__[A]{.solved}__:
```{r, eval = FALSE}
split2 <- function(x, f, drop = FALSE, ...){
# there are three relevant cases for f. f is a character, f is a factor and all
# levels occur, f is a factor and some levels don't occur.
# first we check if f is a factor
fact <- is.factor(f)
# if drop it set to TRUE, we drop the non occuring levels.
# (If f is a character, this has no effect.)
if(drop){f <- f[, drop = TRUE]}
# now we want all unique elements/levels of f
levs <- if (fact) {unique(levels(f))} else {as.character(unique(f))}
# we use these levels to subset x and supply names for the resulting output.
setNames(lapply(levs, function(lv) x[f == lv, , drop = FALSE]), levs)
}
```
4. __[Q]{.Q}__: What other types of input and output are missing? Brainstorm before you look up some answers in the [plyr paper](http://www.jstatsoft.org/v40/i01/).
__[A]{.solved}__: From the suggested plyr paper, we can extract a lot of possible combinations and list them up on a table. Sean C. Anderson already has done this based on a presentation from Hadley Wickham and provided the following result [here](http://seananderson.ca/2013/12/01/plyr.html).
| object type | array | data frame | list | nothing |
|--------------------|-------------|--------------|-------------|-----------|
| array | `apply` | `.` | `.` | `.` |
| data frame | `.` | `aggregate` | `by` | `.` |
| list | `sapply` | `.` | `lapply` | `.` |
| n replicates | `replicate` | `.` | `replicate` | `.` |
| function arguments | `mapply` | `.` | `mapply` | `.` |
Note the column nothing, which is specifically for usecases, where sideeffects like plotting or writing data are intended.
### Manipulating lists
1. __[Q]{.Q}__: Why isn't `is.na()` a predicate function? What base R function is closest
to being a predicate version of `is.na()`?
__[A]{.solved}__: Because a predicate function always returns `TRUE` or `FALSE`. `is.na(NULL)` returns `logical(0)`, which excludes it from being a predicate function. The closest in base that we are aware of is `anyNA()`, if one applies it elementwise.
2. __[Q]{.Q}__: Use `Filter()` and `vapply()` to create a function that applies a summary
statistic to every numeric column in a data frame.
__[A]{.solved}__:
```{r, eval = FALSE}
vapply_num <- function(X, FUN, FUN.VALUE){
vapply(Filter(is.numeric, X), FUN, FUN.VALUE)
}
```
3. __[Q]{.Q}__: What's the relationship between `which()` and `Position()`? What's
the relationship between `where()` and `Filter()`?
__[A]{.solved}__: `which()` returns all indices of true entries from a logical vector. `Position()` returns just the first (default) or the last integer index of all true entries that occur by applying a predicate function on a vector. So the default relation is `Position(f, x) <=> min(which(f(x)))`.
`where()`, defined in the book as:
```{r, eval = FALSE}
where <- function(f, x) {
vapply(x, f, logical(1))
}
```
is useful to return a logical vector from a condition asked on elements of a list or a data frame. `Filter(f, x)` returns all elements of a list or a data frame, where
the supplied predicate function returns `TRUE`. So the relation is
`Filter(f, x) <=> x[where(f, x)]`.
4. __[Q]{.Q}__: Implement `Any()`, a function that takes a list and a predicate function,
and returns `TRUE` if the predicate function returns `TRUE` for any of
the inputs. Implement `All()` similarly.
__[A]{.solved}__: `Any()`:
```{r, eval = FALSE}
Any <- function(l, pred){
stopifnot(is.list(l))
for (i in seq_along(l)){
if (pred(l[[i]])) return(TRUE)
}
return(FALSE)
}
```
`All()`:
```{r, eval = FALSE}
All <- function(l, pred){
stopifnot(is.list(l))
for (i in seq_along(l)){
if (!pred(l[[i]])) return(FALSE)
}
return(TRUE)
}
```
5. __[Q]{.Q}__: Implement the `span()` function from Haskell: given a list `x` and a
predicate function `f`, `span` returns the location of the longest
sequential run of elements where the predicate is true. (Hint: you
might find `rle()` helpful.)
__[A]{.solved}__: Our `span_r()` function returns the first index of the longest sequential run of elements where the predicate is true. In case of more than one longest sequenital, more than one first_index is returned.
```{r, eval = FALSE}
span_r <- function(l, pred){
# We test if l is a list
stopifnot(is.list(l))
# we preallocate a logical vector and save the result
# of the predicate function applied to each element of the list
test <- vector("logical", length(l))
for (i in seq_along(l)){
test[i] <- (pred(l[[i]]))
}
# we return NA, if the output of pred is always FALSE
if(!any(test)) return(NA_integer_)
# Otherwise we look at the length encoding of TRUE and FALSE values.
rle_test <- rle(test)
# Since it might happen, that more than one maximum series of TRUE's appears,
# we have to implement some logic, which might be easier, if we save the rle
# output in a data.frmame
rle_test <- data.frame(lengths = rle_test[["lengths"]],
values = rle_test[["values"]],
cumsum = cumsum(rle_test[["lengths"]]))
rle_test[["first_index"]] <- rle_test[["cumsum"]] - rle_test[["lengths"]] + 1
# In the last line we calculated the first index in the original list for every encoding
# In the next line we calculate a column, which gives the maximum
# encoding length among all encodings with the value TRUE
rle_test[["max"]] <- max(rle_test[rle_test[, "values"] == TRUE, ][,"lengths"])
# Now we just have to subset for maximum length among all TRUE values and return the
# according "first index":
rle_test[rle_test$lengths == rle_test$max & rle_test$values == TRUE, ]$first_index
}
```
### List of functions
1. __[Q]{.Q}__: Implement a summary function that works like `base::summary()`, but uses a
list of functions. Modify the function so it returns a closure, making it
possible to use it as a function factory.
1. __[Q]{.Q}__: Which of the following commands is equivalent to `with(x, f(z))`?
(a) `x$f(x$z)`.
(b) `f(x$z)`.
(c) `x$f(z)`.
(d) `f(z)`.
(e) It depends.
### Mathematical functionals
1. __[Q]{.Q}__: Implement `arg_max()`. It should take a function and a vector of inputs,
and return the elements of the input where the function returns the highest
value. For example, `arg_max(-10:5, function(x) x ^ 2)` should return -10.
`arg_max(-5:5, function(x) x ^ 2)` should return `c(-5, 5)`.
Also implement the matching `arg_min()` function.
__[A]{.solved}__: `arg_max()`:
```{r, eval = FALSE}
arg_max <- function(x, f){
x[f(x) == max(f(x))]
}
```
`arg_min()`:
```{r, eval = FALSE}
arg_min <- function(x, f){
x[f(x) == min(f(x))]
}
```
2. __[Q]{.Q}__: Challenge: read about the
[fixed point algorithm](https://mitpress.mit.edu/sites/default/files/sicp/full-text/book/book-Z-H-12.html#%25_idx_1096).
Complete the exercises using R.
### A family of functions
1. __[Q]{.Q}__: Implement `smaller` and `larger` functions that, given two inputs, return
either the smaller or the larger value. Implement `na.rm = TRUE`: what
should the identity be? (Hint:
`smaller(x, smaller(NA, NA, na.rm = TRUE), na.rm = TRUE)` must be `x`, so
`smaller(NA, NA, na.rm = TRUE)` must be bigger than any other value of x.)
Use `smaller` and `larger` to implement equivalents of `min()`, `max()`,
`pmin()`, `pmax()`, and new functions `row_min()` and `row_max()`.
__[A]{.solved}__: We can do almost everything as shown in the case study in the textbook. First we define the functions `smaller_()` and `larger_()`. We use the underscore suffix, to built up non suffixed versions on top, which will include the `na.rm` parameter. In contrast to the `add()` example from the book, we change two things at this step. We won't include errorchecking, since this is done later at the top level and we return `NA_integer_` if any of the arguments is `NA` (this is important, if na.rm is set to `FALSE` and wasn't needed by the `add()` example, since `+` already returns `NA` in this case.)
```{r}
smaller_ <- function(x, y){
if(anyNA(c(x, y))){return(NA_integer_)}
out <- x
if(y < x) {out <- y}
out
}
larger_ <- function(x, y){
if(anyNA(c(x, y))){return(NA_integer_)}
out <- x
if(y > x) {out <- y}
out
}
```
We can take `na.rm()` from the book:
```{r}
rm_na <- function(x, y, identity) {
if (is.na(x) && is.na(y)) {
identity
} else if (is.na(x)) {
y
} else {
x
}
}
```
To find the identity value, we can apply the same argument as in the textbook, hence our functions are also associative and the following equation should hold:
```
3 = smaller(smaller(3, NA), NA) = smaller(3, smaller(NA, NA)) = 3
```
So the identidy has to be greater than 3. When we generalize from 3 to any real number this means that the identity has to be greater than any number, which leads us to infinity. Hence identity has to be `Inf` for `smaller()` (and `-Inf` for `larger()`), which we implement next:
```{r}
smaller <- function(x, y, na.rm = FALSE) {
stopifnot(length(x) == 1, length(y) == 1, is.numeric(x) | is.logical(x),
is.numeric(y) | is.logical(y))
if (na.rm && (is.na(x) || is.na(y))) rm_na(x, y, Inf) else smaller_(x,y)
}
larger <- function(x, y, na.rm = FALSE) {
stopifnot(length(x) == 1, length(y) == 1, is.numeric(x) | is.logical(x),
is.numeric(y) | is.logical(y))
if (na.rm && (is.na(x) || is.na(y))) rm_na(x, y, -Inf) else larger_(x,y)
}
```
Like `min()` and `max()` can act on vectors, we can implement this easyly for our new functions. As shown in the book, we also have to set the `init` parameter to the identity value.
```{r}
r_smaller <- function(xs, na.rm = TRUE) {
Reduce(function(x, y) smaller(x, y, na.rm = na.rm), xs, init = Inf)
}
# some tests
r_smaller(c(1:3, 4:(-1)))
r_smaller(NA, na.rm = TRUE)
r_smaller(numeric())
r_larger <- function(xs, na.rm = TRUE) {
Reduce(function(x, y) larger(x, y, na.rm = na.rm), xs, init = -Inf)
}
# some tests
r_larger(c(1:3), c(4:1))
r_larger(NA, na.rm = TRUE)
r_larger(numeric())
```
We can also create vectorised versions as shown in the book. We will just show the `smaller()` case to become not too verbose.
```{r}
v_smaller1 <- function(x, y, na.rm = FALSE){
stopifnot(length(x) == length(y), is.numeric(x) | is.logical(x),
is.numeric(y)| is.logical(x))
if (length(x) == 0) return(numeric())
simplify2array(
Map(function(x, y) smaller(x, y, na.rm = na.rm), x, y)
)
}
v_smaller2 <- function(x, y, na.rm = FALSE) {
stopifnot(length(x) == length(y), is.numeric(x) | is.logical(x),
is.numeric(y)| is.logical(x))
vapply(seq_along(x), function(i) smaller(x[i], y[i], na.rm = na.rm),
numeric(1))
}
# Both versions give the same results
v_smaller1(1:10, c(2,1,4,3,6,5,8,7,10,9))
v_smaller2(1:10, c(2,1,4,3,6,5,8,7,10,9))
v_smaller1(numeric(), numeric())
v_smaller2(numeric(), numeric())
v_smaller1(c(1, NA), c(1, NA), na.rm = FALSE)
v_smaller2(c(1, NA), c(1, NA), na.rm = FALSE)
v_smaller1(NA,NA)
v_smaller2(NA,NA)
```
Of course, we are also able to copy paste the rest from the textbook, to solve the last part of the exercise:
```{r}
row_min <- function(x, na.rm = FALSE) {
apply(x, 1, r_smaller, na.rm = na.rm)
}
col_min <- function(x, na.rm = FALSE) {
apply(x, 2, r_smaller, na.rm = na.rm)
}
arr_min <- function(x, dim, na.rm = FALSE) {
apply(x, dim, r_smaller, na.rm = na.rm)
}
```
2. __[Q]{.Q}__: Create a table that has _and_, _or_, _add_, _multiply_, _smaller_, and
_larger_ in the columns and _binary operator_, _reducing variant_,
_vectorised variant_, and _array variants_ in the rows.
a) Fill in the cells with the names of base R functions that perform each of
the roles.
a) Compare the names and arguments of the existing R functions. How
consistent are they? How could you improve them?
a) Complete the matrix by implementing any missing functions.
__[A]{.solved}__ In the following table we can see the requested base R functions, that we are aware of:
| | and | or | add | multiply | smaller | larger |
|------------|----------|----------|----------|----------|----------|----------|
| binary | `&&` | `||` | | | | |
| reducing | `all` | `any` | `sum` | `prod` | `min` | `max` |
| vectorised | `&` | `|` | `+` | `*` | `pmin` | `pmax` |
| array | | | | | | |
Notice that we were relatively strict about the _binary_ row. Since the _vectorised_ and _reducing_ versions are more general, then the _binary_ versions, we could have used them twice. However, this doesn't seem to be the intention of this exercise.
The last part of this exercise can be solved via copy pasting from the book and the last exercise for the _binary_ row and creating combinations of `apply()` and the _reducing_ versions for the _array_ row. We think the array functions just need a dimension and an `rm.na` argument. We don't know how we would name them, but sth. like `sum_array(1, na.rm = TRUE)` could be ok.
The second part of the exercise is hard to solve complete. But in our opinion, there are two important parts. The behaviour for special inputs like `NA`, `NaN`, `NULL` and zero length atomics should be consistent and all versions should have a `rm.na` argument, for which the functions also behave consistent. In the follwing table, we return the output of `` `f`(x, 1) ``, where `f` is the function in the first column and `x` is the special input in the header (the named functions also have an `rm.na` argument, which is `FALSE` by default). The order of the arguments is important, because of lazy evaluation.
| | `NA` | `NaN` | `NULL` | `logical(0)` | `integer(0)` |
|---------|----------|----------|--------------|--------------|--------------|
| `&&` | `NA` | `NA` | `error` | `NA` | `NA` |
| `all` | `NA` | `NA` | `TRUE` | `TRUE` | `TRUE` |
| `&` | `NA` | `NA` | `error` | `logical(0)` | `logical(0)` |
| `||` | `TRUE` | `TRUE` | `error` | `TRUE` | `TRUE` |
| `any` | `TRUE` | `TRUE` | `TRUE` | `TRUE` | `TRUE` |
| `|` | `TRUE` | `TRUE` | `error` | `logical(0)` | `logical(0)` |
| `sum` | `NA` | `NaN` | `1` | `1` | `1` |
| `+` | `NA` | `NaN` | `numeric(0)` | `numeric(0)` | `numeric(0)` |
| `prod` | `NA` | `NaN` | `1` | `1` | `1` |
| `*` | `NA` | `NaN` | `numeric(0)` | `numeric(0)` | `numeric(0)` |
| `min` | `NA` | `NaN` | `1` | `1` | `1` |
| `pmin` | `NA` | `NaN` | `numeric(0)` | `numeric(0)` | `numeric(0)` |
| `max` | `NA` | `NaN` | `1` | `1` | `1` |
| `pmax` | `NA` | `NaN` | `numeric(0)` | `numeric(0)` | `numeric(0)` |
We can see, that the vectorised and reduced numerical functions are all consistent. However it is not, that the first three logical functions return `NA` for `NA` and `NaN`, while the 4th till 6th function all return `TRUE`. Then `FALSE` would be more consistent for the first three or the return of `NA` for all and an extra `na.rm` argument. In seems relatively hard to find an easy rule for all cases and especially the different behaviour for `NULL` is relatively confusing. Another good opportunity for sorting the functions would be to differentiate between "numerical" and "logical" operators first and then between binary, reduced and vectorised, like below (we left the last colum, which is redundant, because of coercion, as intended):
| `` `f(x,1)` `` | `NA` | `NaN` | `NULL` | `logical(0)` |
|----------------|----------|----------|--------------|--------------|
| `&&` | `NA` | `NA` | error | `NA` |
| `||` | `TRUE` | `TRUE` | error | `TRUE` |
| `all` | `NA` | `NA` | `TRUE` | `TRUE` |
| `any` | `TRUE` | `TRUE` | `TRUE` | `TRUE` |
| `&` | `NA` | `NA` | error | `logical(0)` |
| `|` | `TRUE` | `TRUE` | error | `logical(0)` |
| `sum` | `NA` | `NaN` | 1 | 1 |
| `prod` | `NA` | `NaN` | 1 | 1 |
| `min` | `NA` | `NaN` | 1 | 1 |
| `max` | `NA` | `NaN` | 1 | 1 |
| `+` | `NA` | `NaN` | `numeric(0)` | `numeric(0)` |
| `*` | `NA` | `NaN` | `numeric(0)` | `numeric(0)` |
| `pmin` | `NA` | `NaN` | `numeric(0)` | `numeric(0)` |
| `pmax` | `NA` | `NaN` | `numeric(0)` | `numeric(0)` |
The other point are the naming conventions. We think they are clear, but it could be useful to provide the missing binary operators and name them for example `++`, `**`, `<>`, `><` to be consistent.
3. __[Q]{.Q}__: How does `paste()` fit into this structure? What is the scalar binary
function that underlies `paste()`? What are the `sep` and `collapse`
arguments to `paste()` equivalent to? Are there any `paste` variants
that don't have existing R implementations?
__[A]{.solved}__ `paste()` behaves like a mix. If you supply only length one arguments, it will behave like a reducing function, i.e. :
```{r}
paste("a", "b", sep = "")
paste("a", "b","", sep = "")
```
If you supply at least one element with length greater then one, it behaves like a vectorised function, i.e. :
```{r}
paste(1:3)
paste(1:3, 1:2)
paste(1:3, 1:2, 1)
```
We think it should be possible to implement a new `paste()` starting from
```{r}
p_binary <- function(x, y = "") {
stopifnot(length(x) == 1, length(y) == 1)
paste0(x,y)
}
```
The `sep` argument is equivalent to bind `sep` on every `...` input supplied to `paste()`, but the last and then bind these results together. In relations:
```
paste(n1, n2, ...,nm , sep = sep) <=>
paste0(paste0(n1, sep), paste(n2, n3, ..., nm, sep = sep)) <=>
paste0(paste0(n1, sep), paste0(n2, sep), ..., paste0(nn, sep), paste0(nm))
```
We can check this for scalar and non scalar input
```{r}
# scalar:
paste("a", "b", "c", sep = "_")
paste0(paste0("a", "_"), paste("b", "c", sep = "_"))
paste0(paste0("a", "_"), paste0("b", "_"), paste0("c"))
# non scalar
paste(1:2, "b", "c", sep = "_")
paste0(paste0(1:2, "_"), paste("b", "c", sep = "_"))
paste0(paste0(1:2, "_"), paste0("b", "_"), paste0("c"))
```
collapse just binds the outputs for non scalar input together with the collapse input.
In relations:
```
for input A1, ..., An, where Ai = a1i:ami,
paste(A1 , A2 , ..., An, collapse = collapse)
<=>
paste0(
paste0(paste( a11, a12, ..., a1n), collapse),
paste0(paste( a21, a22, ..., a2n), collapse),
.................................................
paste0(paste(am-11, am-12, ..., am-1n), collapse),
paste( am1, am2, ..., amn)
)
```
One can see this easily by intuition from examples:
```{r}
paste(1:5, 1:5, 6, sep = "", collapse = "_x_")
paste(1,2,3,4, collapse = "_x_")
paste(1:2,1:2,2:3,3:4, collapse = "_x_")
```
We think the only paste version that is not implemented in base R is an array version.
At least we are not aware of sth. like `row_paste` or `paste_apply` etc.
## Function Factories
### Closures
__[Q1]{.Q}__: Why are functions created by other functions called closures?
__[A]{.solved}__: As stated in the book:
> because they enclose the environment of the parent function and can access all its variables.
__[Q2]{.Q}__: What does the following statistical function do? What would be a better
name for it? (The existing name is a bit of a hint.)
```{r}
bc <- function(lambda) {
if (lambda == 0) {
function(x) log(x)
} else {
function(x) (x ^ lambda - 1) / lambda
}
}
```
__[A]{.solved}__: It is the logarithm, when lambda equals zero and `x ^ lambda - 1 / lambda` otherwise. A better name might be `box_cox_transformation` (one parametric), you can read about it (here)[https://en.wikipedia.org/wiki/Power_transform].
__[Q3]{.Q}__: What does `approxfun()` do? What does it return?
__[A]{.solved}__: `approxfun` basically takes a combination of 2-dimensional data points + some extra specifications as arguments and returns a stepwise linear or constant interpolation function (defined on the range of given x-values, by default).
__[Q4]{.Q}__: What does `ecdf()` do? What does it return?
__[A]{.solved}__: "ecdf" means empirical density function. For a numeric vector, `ecdf()` returns the appropriate density function (of class `ecdf`, which is inheriting from class `stepfun`). You can describe it's behaviour in 2 steps. In the first part of it's body, the `(x,y)` pairs for the nodes of the density function are calculated. In the second part these pairs are given to `approxfun`.
__[Q5]{.Q}__: Create a function that creates functions that compute the ith [central moment](http://en.wikipedia.org/wiki/Central_moment) of a numeric vector. You can test it by running the following code:
```{r, eval = FALSE}
m1 <- moment(1)
m2 <- moment(2)
x <- runif(100)
stopifnot(all.equal(m1(x), 0))
stopifnot(all.equal(m2(x), var(x) * 99 / 100))
```
__[A]{.solved}__: For a discrete formulation look [here](http://www.r-tutor.com/elementary-statistics/numerical-measures/moment)
```{r, eval = FALSE}
moment <- function(i){
function(x) sum((x - mean(x)) ^ i) / length(x)
}
```
__[Q6]{.Q}__: Create a function `pick()` that takes an index, `i`, as an argument and returns a function with an argument `x` that subsets `x` with `i`.
```{r, eval = FALSE}
lapply(mtcars, pick(5))
# should do the same as this
lapply(mtcars, function(x) x[[5]])
```
__[A]{.solved}__:
```{r, eval = FALSE}
pick <- function(i){
function(x) x[[i]]
}
stopifnot(identical(lapply(mtcars, pick(5)),
lapply(mtcars, function(x) x[[5]]))
)
```
### Case study: numerical integration
__[Q1]{.Q}__: Instead of creating individual functions (e.g. `midpoint()`,
`trapezoid()`, `simpson()`, etc.), we could store them in a list. If we
did that, how would that change the code? Can you create the list of
functions from a list of coefficients for the Newton-Cotes formulae?
__[A]{.solved}__:
__[Q2]{.Q}__: The trade-off between integration rules is that more complex rules are
slower to compute, but need fewer pieces. For `sin()` in the range
[0, $\pi$], determine the number of pieces needed so that each rule will
be equally accurate. Illustrate your results with a graph. How do they
change for different functions? `sin(1 / x^2)` is particularly challenging.
__[A]{.solved}__:
## S3
__[Q1]{.Q}__: The most important S3 objects in base R are factors, data frames, difftimes, and date/times (Dates, POSIXct, POSIXlt). You've already seen the attributes and base type that factors are built on. What base types and attributes are the others built on?
__[A]{.started}__: TODO: Add answer for difftime.
**data frame:** Data frames are build up on (named) lists. Together with the `row.names` attribute and after setting the class to "data.frame", we get a classical data frame
```{r}
df_build <- structure(list(1:2, 3:4),
names = c("a", "b"),
row.names = 1:2,
class = "data.frame")
df_classic <- data.frame(a = 1:2, b = 3:4)
identical(df_build, df_classic)
```
**date/times (Dates, POSIXct, POSIXlt):** Date is just a double with the class attribute set to "Date"
```{r}
date_build <- structure(0, class = "Date")
date_classic <- as.Date("1970-01-01")
identical(date_build, date_classic)
```
POSIXct is a class for date/times that inherits from POSIXt and is built on doubles as well. The only attribute is tz (for timezone)
```{r}
POSIXct_build <- structure(1, class = c("POSIXct", "POSIXt"), tzone = "CET")
POSIXct_classic <- .POSIXct(1, tz = "CET") # note that tz's default is NULL
identical(POSIXct_build, POSIXct_classic)
```
POSIXlt is another date/time class that inherits from POSIXt. It is built on top of a named list and a tzone attribute. Differences between POSIXct and POSIXlt are described in `?DateTimeClasses`.
```{r}
POSIXlt_build <- structure(list(sec = 30,
min = 30L,
hour = 14L,
mday = 1L,
mon = 0L,
year = 70L,
wday = 4L,
yday = 0L,
isdst = 0L,
zone = "CET",
gmtoff = 3600L),
tzone = c("", "CET", "CEST"),
class = c("POSIXlt", "POSIXt"))
POSIXlt_classic <- as.POSIXlt(.POSIXct(13.5 * 3600 + 30))
identical(POSIXlt_build, POSIXlt_classic)
```
1. __[Q]{.Q}__: Draw a Venn diagram illustrating the relationships between functions, generics, and methods.
__[A]{.started}__: Funtions don't have to be generics or methods, but both the latter are functions. It is also possible that a function is both, a method and a generic, at the same time, which seems to be relatively awkward and is definitely not recommended; see also `?pryr::ftype`:
> This function figures out whether the input function is a regular/primitive/internal function, a internal/S3/S4 generic, or a S3/S4/RC method. This is function is slightly simplified as it's possible for a method from one class to be a generic for another class, but that seems like such a bad idea that hopefully no one has done it.
2. __[Q]{.Q}__: Write a constructor for `difftime` objects. What base type are they built on? What attributes do they use? You'll need to consult the documentation, read some code, and perform some experiments.
__[A]{.solved}__: Our constructor should be named `new_class_name`, have one argument for its base type and each attribute and check the base types of these arguments as well.
```{r}
new_difftime <- function(x, units = "auto") {
stopifnot(is.double(x), is.character(units))
structure(x, units = units, class = "difftime")
}
```
However, since the following result prints awkward
```{r}
new_difftime(3)
```
we get a little bit more "inspiration" by the original `difftime()` function and make the regarding changes. Basically we need to implement logic for the units attribute, in case it is set to `"auto"` and convert the value of the underlying double from seconds to the regarding unit, as commented in the following
```{r}
new_difftime <- function(x, units = "auto") {
stopifnot(is.double(x), is.character(units))
# case units == "auto":
if (units == "auto")
# when all time differences are NA, units should be "secs"
units <- if (all(is.na(x))){
"secs"
} else {
# otherwise set the units regarding to the minimal time difference
x_min <- min(abs(x), na.rm = TRUE)
if (!is.finite(x_min) || x_min < 60) {
"secs"
} else if (x_min < 3600) {
"mins"
} else if (x_min < 86400) {
"hours"
} else {
"days"
}
}
# we rescale the underlying double, according to the units
x <- switch(units,
secs = x,
mins = x/60,
hours = x/3600,
days = x/86400,
weeks = x/(7 * 86400))
structure(x, units = units, class = "difftime")
}