6-dataviz.Rmd

# Data visualization basics with `ggplot2`

```{r setup, include = FALSE, purl = FALSE}
library(tidyverse)
library(ca)
library(ggrepel)
berry_data <- read_csv("data/clt-berry-data.csv")
berry_ca_coords <- read_csv("data/berry_ca_coords.csv")
berry_col_coords <- read_csv("data/berry_ca_col_coords.csv")
berry_row_coords <- read_csv("data/berry_ca_row_coords.csv")
berry_ca_res <- read_rds("data/berry_ca_results.rds")
```

## Built-in plots with the ca package

Normally, we present CA results as a graph. You can use the base R `plot()` function *directly* on the output of `ca()` for quick results.

```{r base R plotting example}
plot(berry_ca_res)
```

You can learn some things from this already: Strawberries 1 & 2 had some appearance defects that made them noticably different from the others. The blackberries were generally more earthy, while many of the raspberries and strawberries had more fruity/berry flavor. Often, multivariate data analysis packages have pretty good defaults built into their packages that you should take advantage of to quickly look at your data.

But it's hard to see what's going on with the blueberries. A lot of the text is impossible to read. And some additional color- or shape-coding with a key would be helpful.

## Basics of Tidy Graphics

If you want to make this look publication-friendly by getting rid of overlapping text, changing font size and color, color-coding your berries, etc, you *can* do this with base `R`'s plotting functions. The help files for `?plot.ca` and pages 262-268 in Greenacre's [*Correspondence Analysis in Practice*](https://doi.org/10.1201/9781315369983) demonstrate the wide variety of options available (although Greenacre also explains on pages 283-284 that he used a variety of non-base `R` tools to make figures for the book). In general, however, it's much easier to use the tidyverse package `ggplot2` to manipulate graphs.

`ggplot2` provides a standardized, programmatic interface for data visualization, in contrast to the piecemeal approach common to base `R` graphics plotting.  This means that, while the syntax itself can be challenging to learn, syntax for different tasks differs in logical and predictable ways and it works well together with other `tidyverse` functions and principles (like `select()` and `filter()`).

The schematic elements of a ggplot are as follows:

```{r non-working schematic of a ggplot, eval = FALSE}
# The ggplot() function creates your plotting environment.  We often save it to a variable in R so that we can use the plug-n-play functionality of ggplot without retyping
p <- ggplot(mapping = aes(x = <a variable>, y = <another variable>, ...),
            data = <your data>)

# Then, you can add various ways of plotting data to make different visualizations.
p + 
  geom_<your chosen way of plotting>(...) +
  theme_<your chosen theme> +
  ...
```

In graphical form, the following diagram ([from VT Professor JP Gannon](https://vt-hydroinformatics.github.io/Plotting.html#our-first-ggplot)) gives an intuition of what is happening:

![Basic ggplot mappings.  Color boxes indicate where the elements go in the function and in the plot.](img/GGplot syntax.png)

Since ggplot2 is a tidyverse package, we do need to (re)tidy the data (and, often, keep reshaping it after that). In general, `ggplot2` works best with data in "long" or "tidy" format: one row for every observation or point.

By default, the scatterplot function needs one row per point with one column of coordinates for each axis (normally x and y):

```{r a first ggplot}
berry_data %>%
  # Here we set up the base plot
  ggplot(mapping = aes(x = lms_appearance, y = lms_overall)) + 
   # Here we tell our base plot to add points
  geom_point()                          
```

This doesn't look all that impressive--partly because the data being plotted itself isn't that sensible, and partly because we haven't made many changes. If we want to plot some **summary** data, maybe one point per berry sample, we can use the familiar `tidyverse` functions to reshape our data and pipe it into `ggplot()`.

``` {r basic ggplot2 scatterplot}
berry_data %>%
  select(`Sample Name`, contains(c("9pt_","lms_","us_"))) %>%
  summarize(across(everything(), ~ mean(.x, na.rm = TRUE)), .by = `Sample Name`) -> 
  berry_average_likings

berry_average_likings %>%
  nrow()

berry_average_likings %>%
  ggplot(aes(x = `9pt_overall`, y = `lms_overall`)) +
   #23 points, one per row
  geom_point()
```

This plot has fewer overlapping points and less noise, so it's a lot more informative. But it still doesn't look that good, with the underscores in the axis labels, the printer-unfriendly grey background, etc. Let's start looking at the pieces that make up a `ggplot` so we can change them.

### The `aes()` function and `mapping = ` argument

The `ggplot()` function takes two arguments that are essential, as well as some others you'll only use in specific cases.  The first, `data = `, is straightforward, and you'll usually be passing data to the function at the end of some pipeline using `%>%`

The second, `mapping = `, is less clear.  This argument requires the `aes()` function, which can be read as the "aesthetic" function.  The way that this function works is quite complex, and really not worth digging into here, but it's the place where you **tell `ggplot()` what part of your data is going to connect to what part of the plot**.  So, if we write `aes(x = rating)`, we can read this in our heads as "the values of x will be mapped from the 'rating' column".

This sentence tells us the other important thing about `ggplot()` and the `aes()` mappings: **mapped variables each have to be in their own column**.  This is another reason that `ggplot()` requires tidy data.

### Adding layers with `geom_*()` functions

In the above example, we added (literally, using `+`) a function called `geom_point()` to the base `ggplot()` call.  This is functionally a "layer" of our plot, that tells `ggplot2` how to actually visualize the elements specified in the `aes()` function--in the case of `geom_point()`, we create a point for each row's combination of `x = lms_overall` and `y = 9pt_overall`.

```{r what we are plotting in this example}
berry_average_likings %>%
  select(lms_overall, `9pt_overall`)
```

There are many `geom_*()` functions in `ggplot2`, and many others defined in other accessory packages like `ggrepel`. These determine what symbols and spatial arrangements are used to represent each row, and are the heart of visualizations. Some `geom_*()` functions do some summarization of their own, making them more appropriate for raw data.

We can see this by swapping out the `geom_*()` in our initial scatterplot on the Labeled Magnitude Scale liking data:

```{r changing the geom changes the way the data map}
berry_data %>%
  ggplot(mapping = aes(x = lms_appearance, y = lms_overall)) + 
  geom_smooth()
```

`geom_smooth()` fits a smoothed line to our data. By default, it will use either Local Polynomial Regression or the Generalized Additive Model, depending on the size of your data (here, you can see that it chose `gam`, the Generalized Additive Model). You can specify models manually, using the `method` argument of `geom_smooth()`:

```{r linear regression with geom_smooth}
berry_data %>%
  ggplot(mapping = aes(x = lms_appearance, y = lms_overall)) + 
  geom_smooth(method = "lm")
```

We can also combine layers, as the term implies:

```{r geoms are layers in a plot}
berry_data %>%
  ggplot(mapping = aes(x = lms_appearance, y = lms_overall)) + 
  geom_point() +
  geom_smooth()
```

Note that we don't need to tell *either* `geom_smooth()` or `geom_point()` what `x` and `y` are--they "inherit" them from the `ggplot()` function to which they are added (`+`), which defines the plot itself.

What other arguments can be set to aesthetics?  Well, we can set other visual properties like **color**, **size**, **transparency** (called "alpha"), and so on.  For example, let's try to look at whether there is a relationship between berry condition (proxied here by `cata_appearance_bruised`) and overall liking.

```{r here are some other parts of the plot we can control with data}
berry_data %>%
  #ggplot will drop NA values for you, but it's good practice to
  #think about what you want to do with them:
  drop_na(lms_overall, cata_appearance_bruised) %>%
  #color, shape, linetype, and other aesthetics that would add a key
  #don't like numeric data types. The quick-and-dirty solution:
  mutate(across(starts_with("cata_"), as.factor)) %>%
  ggplot(mapping = aes(x = lms_appearance, y = lms_overall,
                       color = cata_appearance_bruised)) +
  geom_point(alpha = 1/4) + 
  geom_smooth(se = FALSE) +
  theme_bw()
```

We can see that more of the blue dots for samples with a bruised appearance in the lower left of the figure--it has a negative influence on the ratings of overall liking *and* appearance.

### Geoms for categorical, ordinal, or unevenly-distributed data

You may notice that we've been using the Labeled Magnitude Scale data so far, rather than the data from the other two scales. That's because adding `geom_point()` to the 9-point hedonic scale data looks like this:

```{r an unreadable scatterplot}
berry_data %>%
  ggplot(mapping = aes(x = `9pt_appearance`, y = `9pt_overall`)) +
  geom_point()
```

Each of those points actually represents many individual ratings of berries, possibly hundreds. There are almost certainly fewer people giving the berries a 1 for appearance and a 9 for overall liking than there are people rating each a 6. This also makes it a good demonstration of how `ggplot` handles the transparency of overlapping points:

```{r transparent points stack on top of each other to make less transparent points}
berry_data %>%
  ggplot(mapping = aes(x = `9pt_appearance`, y = `9pt_overall`)) +
  geom_point(alpha = 0.05) 
```

But the actual solution to this problem, instead of the hacky pseudo-heat map, is `geom_jitter()`, which applies a small random x and y offset to each point:

```{r using geom_jitter for overlapping points}
berry_data %>%
  ggplot(mapping = aes(x = `9pt_appearance`, y = `9pt_overall`)) +
  geom_jitter(alpha = 1/4) +
  geom_smooth(method = "lm", se = FALSE)
```

You can see there are some overlapping points left, but this gives us a much better idea of the shape, along with the summarizing `geom_smooth()`. Since there are only 9 possible values on the hedonic scale while the continuous Labeled Magnitude Scale allows people to select numbers in-between scale labels, `geom_jitter()` can be thought of as simulating this random human scale usage after the fact.

If you'd like to look at the variation of a single categorical or discrete variable, a bar plot is more appropriate. `geom_bar()` is another **summarizing** geom, similar to `geom_smooth()`, as it expects a discrete `x` variable and *one row per observation*. It will **count** the number of rows in each **group** and use those counts to plot the bar heights, one bar per group. (Note that you can override or tweak this behavior using additional arguments.)

`geom_histogram()` is the version for numeric data, which will also calculate bins for you.

```{r geom_bar and geom_histogram}
#geom_bar() is for when you already have discrete data, it just counts:
berry_data %>%
  ggplot(aes(x = cata_taste_berry)) +
  geom_bar()

berry_data %>%
  ggplot(aes(x = `9pt_overall`)) +
  geom_bar()

#and geom_histogram() is for continuous data, it counts and bins:
berry_data %>%
  ggplot(aes(x = `lms_overall`)) +
  geom_histogram()
```

### Arguments inside and outside of `aes()`

In some previous plots, we've seen some aesthetic elements specified directly inside of geom functions like `geom_point(alpha = 1/4)`, without using `aes()` to **map** a variable to this aesthetic. If we want every point or geom to have the same, fixed look (the same transparency, the same color, etc), we *don't* wrap it in the `aes()` function. `aes()` ties a visual element to a variable.

Note that we *can* map `alpha` to a variable, just like `color`:

```{r using the aes function}
berry_data %>%
  drop_na(lms_overall, cata_appearance_bruised) %>%
  ggplot(aes(x = lms_appearance, y = lms_overall)) + 
  # We can set new aes() mappings in individual layers, as well as the plot itself
  geom_point(aes(alpha = jar_size)) +
  #Unlike color, alpha will accept numeric variables for mapping
  theme_bw()
```

Color would be a better way to represent this relationship, however, as semitransparent points can overlap and appear indistinguishable from a single, darker point.

### Using `theme_*()` to change visual options quickly

In the last plot, notice that we have changed from the default (and to my mind unattractive) grey background of `ggplot2` to a black and white theme.  This is by adding a `theme_bw()` call to the list of commands.  `ggplot2` includes a number of default `theme_*()` functions, and you can get many more through other `R` packages.  They can have subtle to dramatic effects:

```{r using the theme functions}
berry_data %>%
  drop_na(lms_overall, cata_appearance_bruised) %>%
  ggplot(aes(x = lms_appearance, y = lms_overall)) + 
  geom_point() +
  theme_void()
```

You can also edit every last element of the plot's theme using the base `theme()` function, which is powerful but a little bit tricky to use.

### Changing aesthetic elements with `scale_*()` functions

But what about the color of the *points*? None of the themes change the colors used for drawing geoms, or the color scales used for showing categories or additional variables.

The symbols, colors, or other signifiers mapped to aesthetic variables by `mapping()` are are controlled by the `scale_*()` functions. In my experience, the most frequently encountered scales are those for color: either `scale_fill_*()` for solid objects (like the bars in a histogram) or `scale_color_*()` for lines and points (like the outlines of the histogram bars).

Scale functions work by telling `ggplot()` *how* to map aesthetic variables to visual elements. The `viridis` package is a good starting place for color and fill scales, as its `scale_color_viridis_*()` functions provide color-blind and (theoretically) print-safe color palettes.

```{r ggplots are R objects}
# To effectively plot all of the cata attributes on a bar chart, the data
# needs to be longer (one geom_bar() per group, not per column!)
# and we'll remove columns with NAs for now.
berry_cata_long <- 
  berry_data %>%
  select(where(~none(.x, is.na))) %>%
  pivot_longer(starts_with("cata_"),
               names_to = c(NA, "Modality", "Attribute"), names_sep = "_",
               values_to = "Presence")

# And now we can use this for plotting
p <- 
  berry_cata_long %>%
  filter(Presence == 1) %>%
  ggplot(aes(x = Attribute, fill = berry, color = Modality)) + 
  geom_bar(position = position_dodge()) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

p
```

We can take a saved plot (like `p`) and use scales to change how it is visualized.

```{r we can modify stored plots after the fact}
p +
  scale_fill_viridis_d() +
  scale_color_grey(start = 0, end = 0.8) #For bar plots, color is the outline!
```

`ggplot2` has a broad range of built-in options for scales, but there are many others available in add-on packages that build on top of it.  You can also build your own scales using the `scale_*_manual()` functions, in which you give a vector of the same length as your mapped aesthetic variable in order to set up the visual assignment.  That sounds jargon-y, so here is an example:

```{r another example of posthoc plot modification}
# We'll pick 14 random colors from the colors R knows about
random_colors <- print(colors()[sample(x = 1:length(colors()), size = 10)])

p + 
  scale_fill_manual(values = random_colors) +
  scale_color_manual(breaks = c("taste", "appearance"),
                     values = c("lightgrey", "black"))
```

### Finally, `facet_*()`

The last powerful tool I want to show off is the ability of `ggplot2` to make what [Edward Tufte called "small multiples"](https://socviz.co/groupfacettx.html#facet-to-make-small-multiples): breaking out the data into multiple, identical plots by some categorical classifier in order to show trends more effectively.

The bar plot we were just looking at is quite busy, even without displaying all 36 CATA questions. Instead, let's see how we can break out separate plots, for example, different CATA attributes into "small multiple" facet plots to get a look at trends between berries one attribute at a time.

```{r splitting the plot into 12 small multiples}
berry_cata_long %>%
  filter(Presence == 1) %>%
  ggplot(aes(x = berry)) + 
  geom_bar() +
  theme_bw() +
  facet_wrap(~ Attribute) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

```

We can still compare the facets in this case, because they all share X and Y axes. The "none" attribute was checked much less often than the other attributes, for example. We can also see that uneven color was a more common problem among the raspberries and strawberries than the blueberries and blackberries, and that strawberries and blackberries more commonly had fermented flavor.

It would be good to go a step further and plot the percentages, rather than the raw counts, since not every berry had the exact same number of participants. We can use `geom_col()` instead of `geom_bar()` to do our own summarizing:

```{r more control over bar plots with geom_col}
berry_cata_long %>%
  group_by(berry, Attribute, Modality) %>%
  summarize(proportion = mean(Presence)) %>%
  ggplot(aes(x = berry, y = proportion)) + 
  geom_col() +
  theme_bw() +
  facet_wrap(~ Attribute) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

```

Both plots show that blueberries and raspberries are more commonly described by "berry flavor", but looking at the proportions instead of the raw counts reveals that there aren't strong differences in floral flavor across the berry types.

### The ggplot rabbithole

Like many things we're introducing today, you can make infinitely-complicated graphs using these same basic semantics. `ggplot2` is a world of exceptions! You could eventually end up having to do something like this, where each `geom_*` has different `data`:

```{r a complicated ggplot that gives different data to each geom}
ggplot() +
  geom_segment(aes(xend = Dim1, yend = Dim2), x = 0, y = 0,
               arrow = arrow(length = unit(0.25, "cm")),
               data = berry_col_coords) +
  geom_text_repel(aes(x = Dim1, y = Dim2, label = Variable, color = Type,
                      fontface = ifelse(Type == "Attribute", "italic", "plain")),
                  data = berry_ca_coords) +
  scale_color_manual(breaks = c("Attribute","Berry"),
                     values = c("lightgrey","maroon")) +
  theme_bw()
```

It's the fact that the arrows need `xend` and `yend` instead of `x` and `y` like the text, as well as the fact that there are only arrows for half the data, that make it easier to give each geom its own `data`. There are simpler (and possibly better) ways to display the same information as this plot, which we'll cover next.

If you ever find yourself tearing your hair out over a complicated plot, remember this section. Some resources you may find helpful for further reading and troubleshooting include:

1.  Kieran Healy's "[Data Visualization: a Practical Introduction](https://socviz.co/index.html#preface)".
2.  The plotting section of [R for Data Science](https://r4ds.had.co.nz/data-visualisation.html).
3.  Hadley Wickham's core reference textbook on [ggplot2](https://ggplot2-book.org/).

## Better CA plots with ggplot2

We need to think about what we want to plot. The CA maps are **scatterplots** where each point is a **sample** (row variable) or **attribute** (column variable). These coordinates are in the list that the `ca()` function outputs, but in two separate matrices. We already learned how to combine them into *one* where the columns are *axes* and the rows are the *variables* from our initial dataset, with a column specifying whether each row is a sample or an attribute variable.

```{r remember what our tidy ca results look like}
berry_ca_coords
```

This will get us pretty far.

``` {r a basic ca map with ggplot2}
berry_ca_coords %>%
  mutate(Variable = str_remove(Variable, "cata_")) %>%
  ggplot(aes(x = Dim1, y = Dim2, color = Type, label = Variable)) +
  geom_hline(color="black", yintercept = 0) +
  geom_vline(color="black", xintercept = 0) +
  geom_text() +
  theme_bw() +
  xlab("Dimension 1") +
  ylab("Dimension 2")
```

`geom_text()` is similar to `geom_point()`, but instead of having a point with a given `shape`, it places **text** on the plot which you can pull directly from your data using the `label` aesthetic. We can make this even more readable using `geom_text_repel()`, a very similar geom out of the `ggrepel` package:

``` {r a basic ca map with geom_repel}
berry_ca_coords %>%
  mutate(Variable = str_remove(Variable, "cata_")) %>%
  ggplot(aes(x = Dim1, y = Dim2, color = Type, label = Variable)) +
  geom_hline(color="black", yintercept = 0) +
  geom_vline(color="black", xintercept = 0) +
  geom_text_repel() +
  theme_bw() +
  xlab("Dimension 1") +
  ylab("Dimension 2")
```

With a little bit of extra work, we can also add the % inertia to each dimension label, tweak the colors with `color_scale_manual`, and make the text a bit bigger.

```{r a much more fine-tuned ca map with ggplot2}

berry_ca_res$sv %>%
  {str_c("Dimension ", 1:length(.), " (", round(100 * .^2 / sum(.^2), 1), "% Inertia)")} ->
  berry_cata_ca_dimnames

berry_ca_coords %>%
  mutate(Variable = str_remove(Variable, "cata_"),
         Variable = str_replace(Variable, "Strawberry", "Strawberry "),
         font = ifelse(Type == "Attribute", "italic", "plain")) %>%
  separate(Variable, c("Var_Major", "Var_Minor"), remove = FALSE) %>%
  ggplot(aes(x = Dim1, y = Dim2, fontface = font, color = Var_Major, label = Variable)) +
  geom_hline(color="black", yintercept = 0) +
  geom_vline(color="black", xintercept = 0) +
  geom_point() +
  geom_text_repel(size = 5) +
  theme_bw() +
  xlab(berry_cata_ca_dimnames[1]) +
  ylab(berry_cata_ca_dimnames[2]) +
  scale_color_manual(values = c("appearance" = "#bababa",
                                "taste" = "#7f7f7f",
                                "Blackberry" = "#2d0645",
                                "Blueberry" = "#3e17e8",
                                "raspberry" = "#7a1414",
                                "Strawberry" = "#f089cb"))

```

This plot still isn't really what I'd call "publication ready". A lot of the final tweaking will depend on the exact size you want, but regardless I'd probably continue adjusting the labels, zoom the plot out a bit, and consider only displaying CATA terms with a high enough `berry_ca_res$colinertia` so the plot was a bit less cluttered.

You can tweak forever. And I'd encourage you to go ahead and try to do whatever you can think of, right now, to make this graph more readable!

For now, this is the culmination of your new data-wrangling, analysis, and graphing skills. We can see which berries are more fresh-tasting and have no notable appearance attributes (blueberries and blackberries), which berries are the worst-looking (strawberries 2 and 6), and we could identify berries anywhere along our roughly earthy/fermented to fruity/berry Dimension 2 (from blackberry 2 to raspberries 3 and 6).

This is also the same set of skills that you'll need for PCA, MDS, DISTATIS, ANOVA, text analysis, and any other computational or statistical task in R.