forked from jlahne/eurosense-tutorial-2022
-
Notifications
You must be signed in to change notification settings - Fork 0
/
3-tidyverse.Rmd
450 lines (301 loc) · 27.8 KB
/
3-tidyverse.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
# Wrangling data with `tidyverse`: Rows, Columns, and Groups
```{r setup, include = FALSE, purl = FALSE}
library(tidyverse)
berry_data <- read_csv("data/clt-berry-data.csv")
```
A common saying in data science is that about 90% of the effort in an analysis workflow is in getting data wrangled into the right format and shape, and 10% is actual analysis. In a point and click program like SPSS or XLSTAT we don't think about this as much because the activity of reshaping the data--making it longer or wider as required, finding and cleaning missing values, selecting columns or rows, etc--is often temporally and programmatically separated from the "actual" analysis.
In `R`, this can feel a bit different because we are using the same interface to manipulate our data and to analyze it. Sometimes we'll want to jump back out to a spreadsheet program like Excel or even the command line (the "shell" like `bash` or `zsh`) to make some changes. But in general the tools for manipulating data in `R` are both more powerful and more easily used than doing these activities by hand, and you will make yourself a much more effective analyst by mastering these basic tools.
Here, we are going to emphasize the set of tools from the [`tidyverse`](https://www.tidyverse.org/), which are extensively documented in Hadley Wickham and Garrett Grolemund's book [*R for Data Science*](https://r4ds.had.co.nz/). If you want to learn more, start there!
<center>
![The `tidyverse` is associated with this hexagonal iconography.](img/tidyverse-iconography.png)
</center>
Before we move on to actually learning the tools, let's make sure we've got our data loaded up.
```{r reload the berry data in case you started your session over}
library(tidyverse)
berry_data <- read_csv("data/clt-berry-data.csv")
```
## Subsetting data
R's system for **indexing** data frames is clear and sensible for those who are used to programming languages, but not necessarily easy to read.
A common situation in R is wanting to select some rows and some columns of our data--this is called "**subsetting**" our data. But this is less easy than it might be for the beginner in R. Happily, the `tidverse` methods are much easier to read (and modeled after syntax from **SQL**, which may be helpful for some users). Starting with...
### `select()` for columns
The first thing we often want to do in a data analysis is to pick a subset of columns, which usually represent variables. If we take a look at our berry data, we see that, for example, we have some columns that contain the answers to survey questions, some columns that are about the test day or panelist, and some that identify the panelist or berry:
```{r use glimpse() to examine the berry data}
glimpse(berry_data)
```
So, for example, we might want to determine whether there are differences in liking between the berries, test days, or scales, in which case perhaps we only want the columns starting with `us_`, `lms_`, and `9pt_` along with `berry` and `test_day`. We learned previously that we can do this with numeric indexing:
```{r reminder of base R subsetting with column numbers}
berry_data[, c(10,25,28,45,56:64,81,89:91)]
```
However, this is both difficult for novices to `R` and difficult to read if you are not intimately familiar with the data. It is also rather fragile--what if someone else rearranged the data in your import file? You're just selecting columns by their order, so column 91 is not guaranteed to contain the berry type every time. Not to mention all of the counting was annoying!
Subsetting using names is a little more stable, but still not that readable and takes a lot of typing.
```{r reminder of base R subsetting with column names}
berry_data[, c("9pt_aroma","9pt_overall","9pt_taste","9pt_texture","9pt_appearance",
"lms_aroma","lms_overall","lms_taste","lms_texture","lms_appearance",
"us_aroma","us_overall","us_taste","us_texture","us_appearance",
"berry","test_day")]
```
The `select()` function in `tidyverse` (actually from the `dplyr` package) is the smarter, easier way to do this. It works on data frames, and it can be read as "`select()` the columns from \<data frame\> that meet the criteria we've set."
`select(<data frame>, <column 1>, <column 2>, ...)`
The simplest way to use `select()` is just to name the columns you want!
```{r using select() with explicit column names}
select(berry_data, berry, test_day,
lms_aroma, lms_overall, lms_taste, lms_texture, lms_appearance) # note the lack of quoting on the column names
```
This is much clearer to the reader.
Getting rid of the double quotes we previously saw around `"column names"` can have some consequences, though. R variable names can *only* contain letters, numbers, and underscores (_), no spaces ( ) or dashes (-), and can't start with numbers, but the tidyverse *will* let you make column names that break these rules. So if you try to do the exact same thing as before to get the data from the 9-point hedonic scale instead of the labeled magnitude scale:
```{r errors from select() with unusual column names, eval = FALSE}
# this will cause an error
select(berry_data, berry, test_day,
9pt_aroma, 9pt_overall, 9pt_taste, 9pt_texture, 9pt_appearance)
```
You'll run into an error. The solution to this is a different kind of quote called a backtick (`) which is usually next to the number 1 key in the very top left of QWERTY (US and UK) keyboards. On QWERTZ and AZERTY keyboards, it may be next to the backspace key or one of the alternate characters made by the 7 key. Look [at this guide](https://superuser.com/questions/254076/how-do-i-type-the-tick-and-backtick-characters-on-windows) for help finding it in common international keyboard layouts.
```{r using select() with escaped column names}
select(berry_data, berry, test_day, # these "syntactic" column names don't need escaping
`9pt_aroma`, # only ones starting with numbers...
`9pt_overall`, `9pt_taste`, `9pt_texture`, `9pt_appearance`,
`Sample Name`) # or containing spaces
```
The backticks are only necessary when a column name breaks one of the variable-naming rules, and RStudio will usually fill them in for you if you use tab autocompletion when writing your select() and other tidyverse functions.
You can also use `select()` with a number of helper functions, which use logic to select columns that meet whatever conditions you set. For example, the `starts_with()` helper function lets us give a set of characters we want columns to start with:
```{r using select() with a character search}
select(berry_data, starts_with("lms_")) #the double-quotes are back because this isn't a full column name
```
There are equivalents for the end of column names (`ends_with()`) and text found anywhere in the name (`contains()`).
You can combine these statements together to get subsets of columns however you want:
```{r combining character search and explicit select()}
select(berry_data, starts_with("lms_"), starts_with("9pt_"), starts_with("us_"),
berry, test_day)
select(berry_data, starts_with("jar_"), ends_with("_overall"),
berry, test_day)
```
If you're annoyed at the order that your columns are printing in and how far to the right you have to scroll in your previews, `select()` can also be used to rearrange your columns with a helper function called `everything()`:
```{r everything() is a select helper that selects everything (else)}
select(berry_data, everything()) #This selects everything--it doesn't change at all.
select(berry_data, berry, contains("_overall"),
everything()) #only type the columns you want to move to the front, and everything() will keep the rest
```
`everything()` can be very useful for programming, but if you just need to rearrange an even easier function is `relocate()`, which moves whatever columns you specify to the "left" or "right" of the `data.frame`; you can even get more specific and move to the left or right of specific columns:
```{r relocate() moves one or more columns}
# By default relocate() moves the selected column(s) to the left of the data.frame
relocate(berry_data, us_overall)
# ".before/.after = " specify a position relative to another column
relocate(berry_data, `Subject Code`, .before = Gender)
```
You may sometimes want to select columns where the data meets some criteria, rather than the column name or position. The `where()` helper function allows you to insert this kind of programmatic logic into `select()` statements, which gives you the ability to specify columns using any arbitrary function.
```{r using where() with select() is more advanced}
#a few functions return one logical value per vector
select(berry_data, where(is.numeric))
#otherwise we can use "lambda" functions
select(berry_data, where(~!any(is.na(.))))
```
The little squiggly symbol, called a tilde (`~`), is above the backtick on QWERTY keyboards, and it turns everything that comes after it into a "lambda function". We'll talk more about lambda functions later, when we talk about `across()`. You can read it the above as "`select()` the columns from `berry_data` `where()` there are not (`!`) `any()` NA values (`is.na(.)`)".
Besides being easier to write conditions for than indexing with `[]`, `select()` is code that is much closer to how you or I think about what we're actually doing, making code that is more human readable.
### `filter()` for rows
So `select()` lets us pick which columns we want. Can we also use it to pick particular observations? No. For that, there's `filter()`.
We learned that, using `[]` indexing, we'd specify a set of rows we want. If we want the first 10 rows of `berry_data`, we'd write
```{r reminder of how to get rows from a data frame in base R}
berry_data[1:10, ]
```
Again, this is not very human readable, and if we reorganize our rows this won't be useful anymore. The `tidyverse` answer to this approach is the `filter()` function, which lets you filter your dataset into specific rows according to *data stored in the table itself*.
```{r first example of filter() with conditional logic}
# let's get survey responses with had a higher-than-average expectation
filter(berry_data, pre_expectation > 3)
```
When using `filter()`, we can specify multiple logical conditions. For example, let's get only raspberries with initially high expectations. If we want only exact matches, we can use the direct `==` operator:
```{r using the equality operator in filter()}
filter(berry_data, pre_expectation > 3, berry == "raspberry")
```
But this won't return, for example, any berries labeled as "Raspberry", with an uppercase R.
```{r case-sensitivity in filter()}
filter(berry_data, pre_expectation > 3, berry == "Raspberry")
```
Luckily, we don't have a mix of raspberries and Raspberries. Maybe we want both raspberries and strawberries that panelists had high expectations of:
```{r using the OR operator in filter()}
filter(berry_data, pre_expectation > 3, berry == "raspberry" | berry == "strawberry")
```
In `R`, the `|` means Boolean `OR`, and the `&` means Boolean `AND`. If you're trying to figure out which one to use, phrase what you want to do with your `filter()` statement by starting "keep only rows that have...". We may want to look at raspberries *and* strawberries, but we want to "keep only rows that have a `berry` type of raspberry *or* strawberry".
We can combine any number of conditions with `&` and `|` to search within our data table. But this can be a bit tedious. The `stringr` package, part of `tidyverse`, gives a lot of utility functions that we can use to reduce the number of options we're individually writing out, similar to `starts_with()` or `contains()` for columns. Maybe we want the results from the first day of each berry type:
```{r using character searches in filter()}
filter(berry_data, str_detect(test_day, "Day 1"))
```
Here, the `str_detect()` function searched for any text that **contains** "Day 1" in the `test_day` column.
## Combining steps with the pipe: `%>%`
It isn't hard to imagine a situation in which we want to **both** `select()` some columns and `filter()` some rows. There are 3 ways we can do this, one of which is going to be the best for most situations.
Let's imagine we want to get only information about the berries, overall liking, and CATA attributes for blackberries.
First, we can nest functions:
```{r nesting functions}
select(filter(berry_data, berry == "blackberry"), `Sample Name`, contains("_overall"), contains("cata_"))
```
The problem with this approach is that we have to read it "inside out". First, `filter()` will happen and get us only berries whose `berry` is exactly "blackberry". Then `select()` will get `Sample Name` along with columns that match "_overall" or "cata_ in their names. Especially as your code gets complicated, this can be very hard to read (and in the bookdown, doesn't even fit on one line!).
So we might take the second approach: creating intermediates. We might first `filter()`, store that step somewhere, then `select()`:
```{r storing results as intermediates}
blackberries <- filter(berry_data, berry == "blackberry")
select(blackberries, `Sample Name`, contains("_overall"), contains("cata_"))
```
But now we have this intermediate we don't really need cluttering up our `Environment` tab. This is fine for a single step, but if you have a lot of steps in your analysis this is going to get old (and confusing) fast. You'll have to remove a lot of these using the `rm()` command to keep your code clean.
<font color="red">**warning**:</font> `rm()` will *permanently* delete whatever objects you run it on from your `Environment`, and you will only be able to restore them by rerunning the code that generated them in the first place.
```{r the rm() function deletes objects from the Environment}
rm(blackberries)
```
The final method, and what is becoming standard in modern `R` coding, is the **pipe**, which is written in `tidyverse` as `%>%`. This garbage-looking set of symbols is actually your best friend, you just don't know it yet. I use this tool constantly in my R programming, but I've been avoiding it up to this point because it wasn't a part of base R for the vast majority of its history.
OK, enough background, what the heck _is_ a pipe? The term "pipe" comes from what it does: like a pipe, `%>%` let's whatever is on it's left side flow through to the right hand side. It is easiest to read `%>%` as "**AND THEN**".
```{r the mighty pipe!}
berry_data %>% # Start with the berry_data
filter(berry == "blackberry") %>% # AND THEN filter to blackberries
select(`Sample Name`, # AND THEN select sample name, overall liking...
contains("_overall"),
contains("cata_"))
```
In this example, each place there is a `%>%` I've added a comment saying "AND THEN". This is because that's exactly what the pipe does: it passes whatever happened in the previous step to the next function. Specifically, `%>%` passes the **results** of the previous line to the **first argument** of the next line.
### Pipes require that the lefthand side be a single functional command
This means that we can't directly do something like rewrite `sqrt(1 + 2)` with `%>%`:
```{r order of operations with the pipe}
1 + 2 %>% sqrt # this is instead computing 1 + sqrt(2)
```
Instead, if we want to pass binary operators in a pipe, we need to enclose them in `()` on the line they are in:
```{r parentheses will make the pipe work better}
(1 + 2) %>% sqrt() # Now this computes sqrt(1 + 2) = sqrt(3)
```
More complex piping is possible using the curly braces (`{}`), which create new R environments, but this is more advanced than you will generally need to be.
### Pipes always pass the result of the lefthand side to the *first* argument of the righthand side
This sounds like a weird logic puzzle, but it's not, as we can see if we look at some simple math. Let's define a function for use in a pipe that computes the difference between two numbers:
```{r an example of a custom function}
subtract <- function(a, b) a - b
subtract(5, 4)
```
If we want to rewrite that as a pipe, we can write:
```{r argument order still matters with piped functions}
5 %>% subtract(4)
```
But we can't write
```{r the pipe will always send the previous step to the first argument of the next step}
4 %>% subtract(5) # this is actually making subtract(4, 5)
```
We can explicitly force the pipe to work the way we want it to by using `.` **as the placeholder for the result of the lefthand side**:
```{r using the "." pronoun lets you control order in pipes}
4 %>% subtract(5, .) # now this properly computes subtract(5, 4)
```
So, when you're using pipes, make sure that the output of the lefthand side *should* be going into the first argument of the righthand side--this is often but not always the case, especially with non-`tidyverse` functions.
### Pipes are a pain to type
Typing `%>%` is no fun. But, happily, RStudio builds in a shortcut for you: macOS is `cmd + shift + M`, Windows is `ctrl + shift + M`.
## Make new columns: `mutate()`
You hopefully are starting to be excited by the relative ease of doing some things in R with `tidyverse` that are otherwise a little bit abstruse. Here's where I think things get really, really cool. The `mutate()` function *creates a new column in the existing dataset*.
We can do this in base R by setting a new name for a column and using the assign (`<-`) operator, but this is clumsy and often requires care to maintain the proper ordering of rows. Often, we want to create a new column temporarily, or to combine several existing columns. We can do this using the `mutate()` function.
Let's say that we want to create a quick categorical variable that tells us whether a berry was rated higher after the actual tasting event than the pre-tasting expectation.
We know that we can use `filter()` to get just the berries with `post_expectation > pre_expectation`:
```{r using pipes with filter()}
berry_data %>%
filter(post_expectation > pre_expectation)
```
But what if we want to be able to just see this?
```{r first example of mutate()}
berry_data %>%
mutate(improved = post_expectation > pre_expectation) %>%
# We'll select just a few columns to help us see the result
select(`Participant Name`, `Sample Name`, pre_expectation, post_expectation, improved)
```
What does the above function do?
`mutate()` is a very easy way to edit your data mid-pipe. So we might want to do some calculations, create a temporary variable using `mutate()`, and then continue our pipe. **Unless we use `<-` to store our `mutate()`'d data, the results will be only temporary.**
We can use the same kind of functional logic we've been using in other `tidyverse` commands in `mutate()` to get real, powerful results. For example, it might be easier to interpret the individual ratings on the 9-point scale if we can compare them to the `mean()` of all berry ratings. We can do this easily using `mutate()`:
```{r mutate() makes new columns}
# Let's find out the average (mean) rating for the berries on the 9-point scale
berry_data$`9pt_overall` %>% #we need the backticks for $ subsetting whenever we'd need them for tidy-selecting
mean(na.rm = TRUE)
# Now, let's create a column that tells us how far each rating is from to the average
berry_data %>%
mutate(difference_from_average = `9pt_overall` - mean(`9pt_overall`, na.rm = TRUE)) %>%
# Again, let's select just a few columns
select(`Sample Name`, `9pt_overall`, difference_from_average)
```
## Split-apply-combine analyses
Many basic data analyses can be described as *split-apply-combine*: *split* the data into groups, *apply* some analysis into groups, and then *combine* the results.
For example, in our `berry_data` we might want to split the data by each berry sample, calculate the average overall rating and standard deviation of the rating for each, and the generate a summary table telling us these results. Using the `filter()` and `select()` commands we've learned so far, you could probably cobble together this analysis without further tools.
However, `tidyverse` provides two powerful tools to do this kind of analysis:
1. The `group_by()` function takes a data table and groups it by **categorical** values of any column (generally don't try to use `group_by()` on a numeric variable)
2. The `summarize()` function is like `mutate()` for groups created with `group_by()`:
1. First, you specify 1 or more new columns you want to calculate for each group
2. Second, the function produces 1 value for each group for each new column
### Groups of data: Split-apply
The first of these tools is `group_by()`, which you *can* use with `mutate()` if you want to use both individual observed values (one per row) and groupwise summary statistics to calculate a new column.
If we wanted to make a column that shows how each individual person's rating differs from the mean of *that specific berry sample*:
```{r mutate() using group means}
berry_data %>%
group_by(`Sample Name`) %>%
mutate(difference_from_average = `9pt_overall` - mean(`9pt_overall`, na.rm = TRUE)) %>%
# Again, let's select just a few columns
select(`Sample Name`, `9pt_overall`, difference_from_average)
```
You might notice that the `mutate()` call hasn't been changed at all from when we did this in the last code chunk using the global average, but that the numbers are in fact different. The tibble also now tells us that it has `Groups: Sample Name[23]`, because of our group_by() call.
It's good practice to `ungroup()` as soon as you're done calculating groupwise statistics, so you don't cause unexpected problems later in the analysis.
```{r Grouped and ungrouped averages}
berry_data %>%
select(`Sample Name`, `9pt_overall`) %>%
group_by(`Sample Name`) %>%
mutate(group_average = mean(`9pt_overall`, na.rm = TRUE),
difference_from_average = `9pt_overall` - group_average) %>%
ungroup() %>%
mutate(grand_average = mean(`9pt_overall`, na.rm = TRUE))
```
You'll see that the `Groups` info at the top of the tibble is gone, and that thanks to `ungroup()` every single row has the same grand average.
### Split-apply-combine in fewer steps with `summarize()`
So why would we want to use `summarize()`? Well, it's not very convenient to have repeated data if we just want the average rating of each group (or any other groupwise **summary**).
Use `mutate()` when you want to add a **new column with one number for each current row**. Use `summarize()` when you need to use data from **multiple rows to create a summary**.
To accomplish the example above, we'd do the following:
```{r split-apply-combine pipeline example}
berry_summary <-
berry_data %>%
filter(!is.na(`9pt_overall`)) %>% # only rows with 9-point ratings
group_by(`Sample Name`) %>% # we will create a group for each berry sample
summarize(n_responses = n(), # n() counts number of ratings for each berry
mean_rating = mean(`9pt_overall`), # the mean rating for each species
sd_rating = sd(`9pt_overall`), # the standard deviation in rating
se_rating = sd(`9pt_overall`) / sqrt(n())) # multiple functions in 1 row
berry_summary
```
We can use this approach to even get a summary stats table - for example, confidence limits according to the normal distribution:
```{r simple stat summaries with split-apply-combine}
berry_summary %>%
mutate(lower_limit = mean_rating - 1.96 * se_rating,
upper_limit = mean_rating + 1.96 * se_rating)
```
Note that in the above example we use `mutate()`, *not* `summarize()`, because we had saved our summarized data. We could also have calculated `lower_limit` and `upper_limit` directly as part of the `summarize()` statement if we hadn't saved the intermediate.
## Groups of columns and across()
It's more common to have groups of **observations** in tidy data, reflected by categorical variables--each `Subject Code` is a group, each `berry` type is a group, each `testing_day` is a group, etc. But we can also have groups of **variables**, as we do in the `berry_data` we've been using!
We have a group of `cata_` variables, a group of liking data with subtypes `9_pt`, `lms_`, `us_`, `_overall`, `_appearance`, etc...
What if we want to count the total number of times each `cata_` attribute was used for one of the berries?
Well, we *can* do this with `summarize()`, but we'd have to type out the names of all 36 columns manually. This is what `select()` helpers are for, and we *can* use them in functions that operate on rows or groups like `filter()`, `mutate()`, and `summarize()` if we use the new `across()` function.
```{r apply the same tidy operation across() multiple columns}
berry_data %>%
group_by(`Sample Name`) %>%
summarize(across(.cols = starts_with("cata_"),
.fns = sum))
```
You might read this as: `summarize()` `across()` every `.col` that `starts_with("cata_")` by taking the `sum()`.
We can easily expand this to take multiple kinds of summaries for each column, in which case it helps to **name** the functions. `across()` uses **lists** to work with more than one function, so it will look at the **list names** (lefthand-side of the arguments in `list()`) to name the output columns:
```{r using multiple functions in across()}
berry_data %>%
group_by(`Sample Name`) %>%
summarize(across(.cols = starts_with("cata_"),
#the sum of binary cata data gives the citation frequency
.fns = list(frequency = sum,
#meanwhile, the mean gives the percentage.
percentage = mean)))
```
`across()` is capable of taking arbitrarily complicated functions, but you'll notice that we didn't include the parentheses we usually see after a function name for `sum()` and `mean()`. `across()` will just pipe in each column to the `.fns` as the first argument. That means, however, that there's nowhere for us to put additional arguments like `na.rm`.
We can use **lambda functions** to . This basically just means starting each function off with a tilde (~) and telling `across()` where we want our `.cols` to go manually using `.x`.
Remember, the tilde is usually above the backtick on QWERTY keyboards. Try the instructions [here](https://apple.stackexchange.com/questions/286197/typing-the-tilde-character-on-a-pc-keyboard) and [here](https://www.liquisearch.com/tilde/keyboards) to type a tilde if you have a non-QWERTY keyboard. If those methods don't work, try [this guide for Italian keyboards](https://superuser.com/questions/667622/italian-keyboard-entering-tilde-and-backtick-characters-without-changin), [this guide for Spanish keyboards](https://apple.stackexchange.com/q/219603/5472), [this guide for German](https://apple.stackexchange.com/q/395677/5472), [this guide for Norwegian](https://apple.stackexchange.com/q/141066/5472), or [this guide for Swedish](https://apple.stackexchange.com/q/329085/5472) keyboards.
This will be necessary if we want to take the average of our various liking columns without those pesky `NA`s propogating.
```{r lambda functions for more complicated `across()`}
berry_data %>%
group_by(`Sample Name`) %>%
summarize(across(.cols = starts_with("9pt_"),
.fns = list(mean = mean,
sd = sd))) #All NA
berry_data %>%
group_by(`Sample Name`) %>%
summarize(across(.cols = starts_with("9pt_"),
.fns = list(mean = ~ mean(.x, na.rm = TRUE),
sd = ~ sd(.x, na.rm = TRUE))))
```
The `across()` function is very powerful and also pretty new to the tidyverse. It's probably the least intuitive thing we're covering today other than graphs, in my opinion, but it's also leagues better than the `summarize_at()`, `summarize_if()`, and `summarize_all()` functions that came before.
You can also use `across()` to `filter()` rows based on multiple columns or `mutate()` multiple columns at once, but you don't need to worry about `across()` at all if you know exactly what columns you're working with and don't mind typing them all out!