04_DescriptiveStats.Rmd

# Descriptive statistics

In this Chapter we will go through the main elements of descriptive statistics.

In principle, descriptive statistics is the act of taking a bunch of data and represent them in few numbers, such as mean, median, standard deviation etc.

For a more thorough introduction you can check every introductory stats book: The first couple of chapters will cover this.

For examples in this chapter we will use two datasets: *chili*: where and green tea in combination is added to meals and the resulting ad-libitum consumption is recorded, and *pasta* which is iBuffet data with a survey of preferences for each Person.

```{r}
library(data4consumerscience)
data(chili)
data(pasta)
# we subset to only have the "Pasta with legumes" data. 
pasta <- pasta[pasta$StationName=='Pasta with legumes',]
```

If you do not have imported the *data4consumerscience* package see: [Import data from R-package] If you need to import data see: [How to import data]

## Descriptives for a continuous variable

Below we will first explain the different descriptive measures, and then describe calculations of them. Continuous variables could be sensory scores (e.g. 15 cm continuous line scale), consumption data (in gram) or liking scores (e.g. on a scale from 1 to 9).

### Mean / median

Here you get an introduction to statistics, descriptive statistics and the terms average/mean and median:

```{r, echo=FALSE}
vembedr::embed_youtube("https://www.youtube.com/watch?v=uhxtUt_-GyM&list=PL1328115D3D8A2566&index=1")
```

### Variance

Here you get an introduction to variance:

```{r,echo=FALSE}
vembedr::embed_youtube("https://www.youtube.com/watch?v=Qf3RMGXR-h8&list=PL1328115D3D8A2566&index=4")
```

### Standard deviation

Here you get an introduction to the standard deviation:

```{r, echo=FALSE}
vembedr::embed_youtube("https://www.youtube.com/watch?v=HvDqbzu0i0E&list=PL1328115D3D8A2566&index=5")
```

### Calculations

For the calculations, we use the *chili* dataset. To compute mean, median, variance, standard deviation, etc. there are functions working directly on vectors/columns in you data frame. An easy way to specify this in R is using the **\$**-sign after the name of your data frame, followed by the column name (e.g. `chili$Totalg`). `na.rm = TRUE` removes NA values in the vector. This can be done for all the below functions.

```{r}
mean(chili$Totalg, na.rm = TRUE)
median(chili$Totalg)
var(chili$Totalg)
sd(chili$Totalg)
IQR(chili$Totalg)
summary(chili$Totalg)
```

## Distributions of count data

If the response can take certain values or categories, then the table function is good in getting how many observations there are within a given vector, or combinations of several vectors.

```{r}
table(pasta$I_like_taste_of_pasta_with_legumes)
```

```{r}
table(pasta$I_like_taste_of_pasta_with_legumes, pasta$Did_you_consider_the_proteincontent_of_the_dishes_you_choose)
```

```{r, include=FALSE, eval=FALSE}
#[MANGLER: skal vi overveje at lave dette med CATA data i stedet for? jeg er ikke helt vild med legumes datasættet nemlig.... men kan også bare blive stående]
```

You see that most of the answers are in agreement with question, and that there are no observations in the `Strongly disagree` category.

This is a very high level representation, and we usually want to compare means (or other metrics) between different groups. That is to compute descriptive statistics for subsets of the data. There are two ways to do this. Either using the **aggregate()** function or use the **group_by()** and **summarize()** from the *tidyverse* framework.

Below both is shown to characterize `Totalg` on each of the products

## Aggregate

```{r}
aggregate(chili$Totalg,by = list(chili$Treatment),FUN = mean)
aggregate(chili$Totalg,by = list(chili$Treatment),FUN = sd)
```

**aggregate()** will apply a function to a column or data set, using the list provided to group the column or dataset.

In our case, the column we want to apply our function to is `chili$Totalg`, the functions are `FUN = mean` or `FUN = sd` (finding the mean and standard deviation, respectively) and the column we use for grouping the data is `chili$Treatment`.

This results in a data frame, where one column shows each unique group in `chili$Treatment`, and another column shows the mean or standard deviation of `chili$Totalg` corresponding to each group.

## Tidyverse

... or you can do it using *tidyverse*:

```{r}
library(tidyverse)
tb <- chili %>% 
  group_by(Treatment) %>% # specify which grouping vector to use
  dplyr::summarise(n = n(), # compute n
            mn = mean(Totalg), # compute mean
            s = sd(Totalg), # compute s
            q1 = quantile(Totalg,0.25), # compute lower 25% quartile
            q3 = quantile(Totalg,0.75)) # compute upper 75% quartile
tb  
```

**group_by** will use the column specified to group the data, and the groups are then used when further manipulation is applied to the data frame.

**summarise** will create a summary of the input data, with the columns specified in the code, calculation the statistics also written in the code (e.g. `mn = mean(Totalg)` will create a column with the mean of `Totalg`, for each group in `Treatment`, as specified by `group_by()`).

For more information about data manipulation with tidyverse, see [Edit using Tidyverse] or check out the [tidyverse-homepage](https://www.tidyverse.org/).

Further, lets print the results in a nice looking table using kable() from the knitr package.

```{r}
library(knitr)
kable(tb, caption = 'some caption', digits = 0, format = 'simple')
```

```{r, include=FALSE,eval=FALSE}
#[MORTEN: kan vi lave en anden form for tabel her? Meld gerne ind hvad type data det kunne være?? for den man får ud giver ikke så meget mening set fra vores side. og så skal vi overveje om "tidyverse" er nice to know viden?]

```

... and a plot of it:

```{r}
tb %>% 
  ggplot(data = ., aes(Treatment,mn, ymin = q1, ymax = q3)) + 
  geom_point() + 
  geom_errorbar()
```

`tb %>%`: This line is using the pipe operator (`%>%`), which is native to the *dplyr*-package and is used in R to chain together multiple operations. It takes the result of the expression on the left and pipes it as the first argument to the function on the right. In this case, `tb` is being piped into the subsequent operations.

`ggplot(data = ., aes(Treatment, mn, ymin = q1, ymax = q3))`: This line initiates the creation of a ggplot object using the data from the `tb` dataset. It specifies the aesthetics (visual mappings) for the plot. `Treatment` is mapped to the x-axis, `mn` is mapped to the y-axis, and `q1` and `q3` are representing the first and third quartile respectively, here graphically represented using errorbars.

`geom_point()`: This line adds a layer to the plot with points (or dots) representing the data points. Since no aesthetic mappings are provided in this specific line, it's assumed that the `Treatment` values are plotted against the `mn` values on the x- and y-axes.

`geom_errorbar()`: This line adds another layer to the plot with error bars. The `ymin` and `ymax` aesthetics defined in the `ggplot()` call earlier are used to show the first and third quartile respectively, creating a visual representation of the distribution of the data.

More information on plotting will come in a future chapter (see [Plotting data]). But try and see if you can make sense of the inputs, and what they correspond to in the plot.