-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New linter to recommend using %in%
#1875
Comments
Great idea! I find myself fixing this issue in code quite often. Small aside: for efficiency benchmarks, I would blow up the sample size substantially. At microsecond scale, it can be hard to discern true differences vs. overheads. Here's a comparison at
Note that efficiency is not actually better for random input when there's only two or three comparisons (somewhat surprisingly... my guess is that |
Here is another context which will fall under the purview of this linter (with an example inspired by R Inferno): x <- 1:6
x == c(1, 3) # wrong answer, and no warning!
#> [1] TRUE FALSE FALSE FALSE FALSE FALSE
x %in% c(1, 3)
#> [1] TRUE FALSE TRUE FALSE FALSE FALSE x <- 1:5
x == c(1, 3) # wrong answer
#> Warning in x == c(1, 3): longer object length is not a multiple of shorter
#> object length
#> [1] TRUE FALSE FALSE FALSE FALSE
x %in% c(1, 3)
#> [1] TRUE FALSE TRUE FALSE FALSE Created on 2023-01-08 with reprex v2.0.2 |
Note that e.g. |
Some more data on the performance aspect suggests that performance is better for OR up to 4 comparisons size <- 1e7
seed <- 1304L
max_cmp <- 5L
l <- withr::with_seed(seed, sample(letters, size = size, replace = TRUE))
for (n_compare in seq_len(max_cmp - 1L) + 1L) {
cmps <- head(letters, n_compare)
code_or <- parse(text = paste0("l == '", cmps, "'", collapse = " | "))
code_in <- parse(text = paste0("l %in% ", deparse(cmps)))
cat("## ", n_compare, " comparisons:\n", sep = "")
print(microbenchmark::microbenchmark(
eval(code_or),
eval(code_in),
times = 30L
))
} Output on my machine:
|
The same seems to be true also for their memory footprints, but only for 2 comparisons: size <- 1e7
seed <- 1304L
max_cmp <- 5L
l <- withr::with_seed(seed, sample(letters, size = size, replace = TRUE))
for (n_compare in seq_len(max_cmp - 1L) + 1L) {
cmps <- head(letters, n_compare)
code_or <- parse(text = paste0("l == '", cmps, "'", collapse = " | "))
code_in <- parse(text = paste0("l %in% ", deparse(cmps)))
cat("## ", n_compare, " comparisons: -------------------\n", sep = "")
print(bench::mark(
"or" = eval(code_or),
"in" = eval(code_in),
min_iterations = 30L,
filter_gc = FALSE,
)[1:5])
cat("\n")
}
#> ## 2 comparisons: -------------------
#> # A tibble: 2 × 5
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 or 81.5ms 84.1ms 11.6 114MB
#> 2 in 156ms 162.8ms 6.01 153MB
#>
#> ## 3 comparisons: -------------------
#> # A tibble: 2 × 5
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 or 137ms 139ms 7.09 191MB
#> 2 in 143ms 146ms 6.72 153MB
#>
#> ## 4 comparisons: -------------------
#> # A tibble: 2 × 5
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 or 194ms 201ms 4.66 267MB
#> 2 in 167ms 174ms 5.59 153MB
#>
#> ## 5 comparisons: -------------------
#> # A tibble: 2 × 5
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 or 251ms 259ms 3.77 343MB
#> 2 in 148ms 156ms 6.20 153MB Created on 2024-06-25 with reprex v2.1.0 |
Just to have a better view of this: library(ggplot2)
size <- 1e7
seed <- 1304
max_cmp <- 10
l <- withr::with_seed(seed, sample(letters, size = size, replace = TRUE))
res <- bench::press(
n_compare = 1:10,
{
cmps <- head(letters, n_compare)
code_or <- parse(text = paste0("l == '", cmps, "'", collapse = " | "))
code_in <- parse(text = paste0("l %in% ", deparse(cmps)))
bench::mark(
"or" = eval(code_or),
"in" = eval(code_in),
min_iterations = 20,
filter_gc = FALSE,
)
}
)
#> Running with:
#> n_compare
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10 ggplot(res, aes(n_compare, median, color = as.character(expression))) +
geom_line() +
geom_point() +
labs(y = "Seconds", color = "") ggplot(res, aes(n_compare, mem_alloc/1e6, color = as.character(expression))) +
geom_line() +
geom_point() +
labs(y = "Memory (MB)", color = "") |
I sometimes see code like the following:
It would be good to recommend using
%in%
in this situation.%in%
is more readable and more efficient.Created on 2022-12-23 with reprex v2.0.2.9000
The text was updated successfully, but these errors were encountered: