forked from jlahne/eurosense-tutorial-2022
-
Notifications
You must be signed in to change notification settings - Fork 0
/
5-untidyanalysis.Rmd
327 lines (229 loc) · 18.2 KB
/
5-untidyanalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
# Untidy Data Analysis
```{r setup, include = FALSE, purl = FALSE}
library(tidyverse)
library(ca)
berry_data <- read_csv("data/clt-berry-data.csv")
```
## Correspondence Analysis Overview
Remember what we said about most of the work of analysis being data wrangling? Now that we're about 2/3 of the way through this workshop, it's finally time to talk **data analysis**. The shape you need to wrangle your data *into* is determined by the analysis you want to run. Today, we have a bunch of **categorical data** and we want to understand the overall patterns within it. Which berries are **similar** and which are **different** from each other? Which sensory attributes are driving those differences?
One analysis well-suited to answer these questions for CATA and other categorical data is **Correspondence Analysis**. It's a singular value decomposition-based dimensionality-reduction method, so it's similar to Principal Component Analysis, but can be used when individual observations don't have numerical responses, or when the distances between those numbers (e.g., rankings) aren't meaningful.
We're using Correspondence Analysis as an example, so you can see the tidyverse in action! This is not intended to be a statistics lesson. If you want to understand the theoretical and mathematical underpinnings of Correspondence Analysis and its variations, we recommend Michael Greenacre's book [*Correspondence Analysis in Practice*](https://doi.org/10.1201/9781315369983). It was written by the author of the R package we'll be using today (`ca`), and includes R code in the appendix.
### The CA package
Let's take a look at this `ca` package!
```{r loading the ca package and looking at help files, eval = FALSE}
library(ca)
?ca
```
The first thing you'll see is that it wants, specifically, a data frame or matrix as `obj`. We'll be ignoring the `formula` option, because a frequency table stored as a matrix is more flexible: you can use it in other functions like those in the `FactoMineR` package.
A **frequency table** or **contingency table** is one way of numerically representing multiple categorical variables measured on the same set of observations. Each row represents one group of observations, each column represents one level of a given categorical variable, and the cells are filled with the number of observations in that group that fall into that level of the categorical variable.
The help file shows us an example of a frequency table included in base R, so we can take a look at the shape we need to match:
```{r example contingency table}
data("author")
str(author)
head(author)
```
Now let's take a look at the **tidy** CATA data we need to convert into a frequency table:
```{r example of tidy CATA data}
berry_data %>%
select(`Sample Name`, `Subject Code`, starts_with("cata_"))
```
This is a pretty typical way for data collection software to save data from categorical questions like CATA and ordinal questions like JAR: one column per attribute. It's also, currently, a `tibble`, which is not in the list of objects that the `ca()` function will take.
We need to **untidy** our data to do this analysis, which is pretty common, but the `tidyverse` functions are still going to make it much easier than trying to reshape the data in base R. The makers of the packages have included some helpful functions for converting data *out of the tidyverse*, which we'll cover in a few minutes.
## Categorical, Character, Binomial, Binary, and Count data
Now, you might notice that, for categorical data, there are an awful lot of numbers in both of these tables. Without giving a whole statistics lesson, I want to take a minute to stress that the **data type** in R or another statistical software (`logical > integer > numeric > character`) is **not** necessarily the same as your statistical **level of measurement** (categorical/numerical or nominal/ordinal/interval/ratio).
It is your job as a sensory scientist and data analyst to understand what kind of data you have *based on how it was collected*, and select appropriate analyses accordingly.
In `berry_data$cata_*`, the data is represented as a 1 if that panelist checked that attribute for that sample, and a 0 otherwise. This could also be represented as a logical `TRUE` or `FALSE`, but the 0 and 1 convention makes binomial statistics most common, so you'll see it a lot.
You can also pretty easily convert between binomial/binary data stored as `numeric` 0s and 1s and `logical` data.
```{r converting binary data from numeric to logical}
testlogic <- c(TRUE,FALSE,FALSE)
testlogic
class(testlogic) #We start with logical data
testnums <- as.numeric(testlogic) #as.numeric() turns FALSE to 0 and TRUE to 1
testnums
class(testnums) #Now it's numeric
as.logical(testnums) #We can turn numeric to logical data the same way
as.logical(c(0,1,2,3)) #Be careful with non-binary data, though!
```
You may also have your categorical data stored as a `character` type, namely if the categories are **mutually exclusive** or if you have free response data where there were not a finite/fixed set of options. The latter can quickly become thorny to deal with, because every single respondent could have their own unique categories that don't align with any others. Julia Silge and David Robinson's book [*Text Mining with R*](https://www.tidytextmining.com/) is a good primer for tidying and doing basic analysis of text data.
The first type of `character` data (with a limited number of mutually exclusive categories), thankfully, is much easier to deal with. We could turn the `berry` type variable into four separate **indicator variables** with 0s and 1s, if the analysis called for it, using `pivot_wider()`. Because we're turning *one column* into *multiple columns*, we're making the data wider, even though we don't actually want to make it any shorter.
```{r }
berry_data %>%
mutate(presence = 1) %>%
pivot_wider(names_from = berry,
values_from = presence, values_fill = 0) %>%
#The rest is just for visibility:
select(ends_with("berry"), `Sample Name`, everything()) %>%
arrange(lms_overall)
```
And we can see here that `pivot_wider()` increased the number of columns without decreasing the number of rows. By default, it will only combine rows where every column other than the `names_from` and `values_from` columns are identical.
It's often possible to convert between data types by changing the scope of your focus. **Summary tables** of categorical data can include numerical statistics (and this might give you a clue as to which `tidyverse` verb we're going to be using the most in this chapter). The most common kind of summary statistic for categorical variables is the **count** or **frequency**, which is where frequency tables get their name.
```{r using summarize to count the CATA frequency}
berry_data %>%
group_by(`Sample Name`) %>%
summarize(across(starts_with("cata_"), sum))
```
Note that the there are some attributes with NA counts. If we reran the analysis with `na.rm = TRUE`, we'd see that these attributes have zero citations for the berries that were `NA` before. This is because some attributes were only relevant for some of the berries. You will have to think about whether and how to include any of these variables in your analysis.
For now, we'll just drop those terms.
```{r frequency table of terms shared by all berries}
berry_data %>%
group_by(`Sample Name`) %>%
summarize(across(starts_with("cata_"), sum)) %>%
select(where(~ none(.x, is.na))) -> berry_tidy_contingency
berry_tidy_contingency
```
## Untidy Analysis
We have our contingency table now, right? That wasn't so hard! Let's do CA!
```{r error with tibbles in base R functions, eval = FALSE}
berry_tidy_contingency %>%
ca()
```
To explain why this error happens, we're going to need to talk a bit more about base R, since `ca` and many other data analysis packages aren't part of the `tidyverse`. Specifically, we need to talk about matrices and row names.
### Untidying Data
Let's take another look at the ways the example `author` dataset are different from our data.
```{r tabular data with and without rownames}
class(author)
dimnames(author)
class(berry_tidy_contingency)
dimnames(berry_tidy_contingency)
```
The example data we're trying to replicate is a `matrix`. This is another kind of tabular data, similar to a `tibble` or `data.frame`. The thing that sets matrices apart is that *every single cell in a matrix has the same data type*. This is a property that a lot of matrix algebra relies upon, like the math underpinning Correspondence Analysis.
Because they're tabular, it's very easy to turn a `data.frame` *into* a `matrix`, like the `ca()` function alludes to in the help files.
```{r turning data frames into matrices (badly)}
as.matrix(berry_tidy_contingency)
```
This is what the `ca()` function does for us when we give it a `data.frame` or `tibble`. It follows the hierarchy of data types, so you'll see that now every single number is surrounded by quotation marks now ("). It's been converted into the least-restrictive data type in `berry_tidy_contingency`, which is `character`.
Unfortunately, you can't do math on `character` vectors.
```{r trying to do math with character data, eval = FALSE}
1 + 2
"1" + "2"
```
It's important to know which row corresponds to which berry, though, so we want to keep the `Sample Name` column *somehow*! This is where `rownames` come in handy, which the `author` data has but our `berry_tidy_contingency` doesn't.
The `tidyverse` doesn't really use row names (it is technically *possible* to have a tibble with `rownames`, but extremely error-prone). The theory is that whatever information you *could* use as `rownames` could be added as another column, and that you may have multiple variables whose combined levels define each row (say the sample and the participant IDs) rather than needing a single less-informative ID unique to each row.
Row names are important to numeric matrices, though, because we can't do math on a matrix of character variables!
The `tidyverse` provides a handy function for this, `column_to_rownames()`:
```{r Turning a Column to Rownames}
berry_tidy_contingency %>%
column_to_rownames("Sample Name") -> berry_contingency_df
class(berry_contingency_df)
dimnames(berry_contingency_df)
```
Note that you have to double-quote ("") column names for `column_to_rownames()`. No idea why. I just do what `?column_to_rownames` tells me.
`berry_contingency_df` is all set for the `ca()` function now, but if you run into any functions (like many of those in `FactoMineR` and other packages) that need matrices, you can always use `as.matrix()` on the results of `column_to_rownames()`.
`column_to_rownames()` will almost always be the cleanest way to untidy your data, but there are some other functions that may be handy if you need a different data format, like a vector. You already know about $-subsetting, but you can also use `pull()` to pull one column out of a `tibble` as a vector using tidyverse syntax, so it fits easily at the end or in the middle of a series of piped steps.
```{r }
berry_data %>%
pivot_longer(starts_with("cata_"),
names_to = "attribute",
values_to = "presence") %>%
filter(presence == 1) %>%
count(attribute) %>%
#Arranges the highest-cited CATA terms first
arrange(desc(n)) %>%
#Pulls the attribute names as a vector, in the order above
pull(attribute)
```
In summary:
- Reshape your data in the `tidyverse` and then change it as needed for analysis.
- If you need a `data.frame` or `matrix` with `rownames` set, use `column_to_rownames()`.
- Use `as.matrix()` carefully, only on tabular data with **all the same data type**.
- `as.matrix()` may not work on `tibble`s *at all* in older versions of the `tidyverse`, so it's always safest to go `tibble` -> `data.frame` -> `matrix`.
- If you need a vector, use `pull()`.
### Data Analysis
Okay, are you ready? Our data is *finally* in the shape and format we needed. You're ready to run multivariate statistics in R.
Ready?
Are you sure?
```{r doing Correspondence Analysis!!}
ca(berry_contingency_df) -> berry_ca_res
```
Yep, that's it.
There are other options you can read about in the help files, if you need a more sophisticated analysis, but most of the time, if I need to change something, it's with the way I'm arranging my data *before* analysis rather than fundamentally changing the `ca()` call.
In general, I find it easiest to do all of the filtering and selecting *on the tibble* so I can use the handy `tidyverse` functions, before I untidy the data, but you can also include extra rows or columns in your contingency table (as long as they're also numbers!!) and then tell the `ca()` function which columns are active and supplemental. This may be an easier way to compare a few different analyses with different variables or levels of summarization, rather than having to make a bunch of different contingency matrices for each.
```{r }
berry_data %>%
select(`Sample Name`, contains(c("cata_", "9pt_", "lms_", "us_"))) %>%
summarize(across(contains("cata_"), ~ sum(.x, na.rm = TRUE)),
across(contains(c("9pt_","lms_","us_")), ~ mean(.x, na.rm = TRUE)), .by = `Sample Name`) %>%
column_to_rownames("Sample Name") %>%
#You have to know the NUMERIC indices to do it this way.
ca(supcol = 37:51)
```
### Retidying Data
What does the `ca()` function actually give us?
```{r structure of ca() results}
berry_ca_res %>%
str()
```
It's a list with many useful things. You can think of a list as kinda like a data frame, because each item has a **name** (like columns in data frames), except they can be any length/size/shape. It's *not* tabular, so you can't use [1,2] for indexing rows and columns, but you *can* use $ indexing if you know the name of the data you're after.
You're unlikely to need to worry about the specifics. Just remember that lists can be $-indexed.
There are a few things we may want out of the list that `ca()` gives us, and we can see descriptions of them in plain English by typing `?ca`. These are the ones we'll be using:
```{r the most commonly-used parts of ca() results}
berry_ca_res$rowcoord #the standard coordinates of the row variable (berry)
berry_ca_res$colcoord #the standard coordinates of the column variable (attribute)
berry_ca_res$sv #the singular value for each dimension
berry_ca_res$sv %>% #which are useful in calculating the % inertia of each dimension
{.^2 / sum(.^2)}
#The column and row masses (in case you want to add your own supplementary variables
#after the fact):
#the row masses
berry_ca_res$rowmass
#the column masses
berry_ca_res$colmass
```
The *main* results of CA are the row and column coordinates, which are in two matrices with the same columns. We can tidy them with the reverse of `column_to_rownames()`, `rownames_to_column()`, and then we can use `bind_rows()` to combine them.
```{r tidying the row and column coordinates}
berry_row_coords <- berry_ca_res$rowcoord %>%
#rownames_to_column() works on data.frames, not matrices
as.data.frame() %>%
#This has to be the same for both to use bind_rows()!
rownames_to_column("Variable")
#Equivalent to the above, and works on matrices
berry_col_coords <- berry_ca_res$colcoord %>%
as_tibble(rownames = "Variable")
berry_ca_coords <- bind_rows(Berry = berry_row_coords,
Attribute = berry_col_coords,
.id = "Type")
berry_ca_coords
```
We could also add on any columns that have one value for each product *and* each attribute (or fill in the gaps with `NA`s). Maybe we want a column with the `rowmass`es and `colmass`es. These are vectors, so it would be handy if we could wrangle them into tibbles first.
You can use either `tibble()` or `data.frame()` to make vectors in the same order into a table. They have basically the same usage. Just make sure you name your columns!
```{r turning multiple vectors into one tibble}
berry_rowmass <- tibble(Variable = berry_ca_res$rownames,
Mass = berry_ca_res$rowmass)
berry_rowmass
```
If you have an already-named vector, `enframe()` is a handy shortcut to making a two-column tibble, but unfortunately this isn't how the `ca` package structures its output.
```{r turning named vectors into tibbles}
named_colmasses <- berry_ca_res$colmass
names(named_colmasses) <- berry_ca_res$colnames
berry_colmass <- named_colmasses %>%
enframe(name = "Variable",
value = "Mass")
berry_colmass
```
And now we can use `bind_rows()` and `left_join()` to jigsaw these together.
```{r remember: tidy-joining multiple tables}
bind_rows(berry_colmass, berry_rowmass) %>%
left_join(berry_ca_coords, by = "Variable")
```
In summary:
- Many analyses will give you **lists** full of every possible piece of data you could need, which aren't necessarily tabular.
- If you need to turn a **tabular data** with `rownames` into a tibble, use `rownames_to_column()` or `as_tibble()`.
- If you need to turn a **named vector** into a two-column table, use `enframe()`
- If you need to turn **multiple vectors** into a table, use `tibble()` or `data.frame()`.
- You can combine multiple tables together using `bind_rows()` and `left_join()`, if you manage your column names and the orders of your vectors carefully.
Like with our *untidying* process, the shape you need to get your data into during *retidying* is ultimately decided by what you want to do with it next. Correspondence Analysis is primarily a graphical method, so next we're going to talk about graphing functions in R in our last substantive chapter. By the end, you will be able to make the plots we showed in the beginning!
Let's take a quick moment to **save our data** before we move on, though, so we don't have to rerun our `ca()` whenever we restart `R` to make more changes to our graph.
As we've shown before, you can save tabular data easily:
```{r saving the tidy and tabular ca results}
berry_ca_coords %>%
write_csv("data/berry_ca_coords.csv")
berry_col_coords %>%
write_csv("data/berry_ca_col_coords.csv")
berry_row_coords %>%
write_csv("data/berry_ca_row_coords.csv")
```
But `.csv` is a **tabular format**, so it's a little harder to save the whole non-tabular list of `berry_ca_res` as a table. There's a lot of stuff we may need later, though, so just in case we can save it as an `.Rds` file:
```{r saving the jagged berry_ca_res as an rds}
berry_ca_res %>%
write_rds("data/berry_ca_results.rds")
```