Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
new scripts for don't use boxplot for binomial data and histogram for small n.
  • Loading branch information
cxli233 authored Dec 10, 2024
1 parent 70eb815 commit dc346fb
Show file tree
Hide file tree
Showing 2 changed files with 221 additions and 0 deletions.
97 changes: 97 additions & 0 deletions Scripts/BoxPlot_for_Binomial.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
title: "BoxPlot_for_Binomial"
author: "Chenxin Li"
date: "2024-12-10"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Friends don't let friends use boxpot for binomial data

# Packages
```{r}
library(tidyverse)
library(RColorBrewer)
library(ggbeeswarm)
library(patchwork)
```

## Data
```{r}
set.seed(666)
```

```{r}
data1 <- data.frame(
response = rnorm(n = 100, mean = 5, sd = 2)
) %>%
mutate(group = "group1")
data2 <- data.frame(
response = c(
rnorm(n = 50, mean = 2.5, sd = 1),
rnorm(n = 50, mean = 7.5, sd = 1)
)) %>%
mutate(group = "group2")
data3 <- data.frame(
response = c(
rnorm(n = 33, mean = 2, sd = 0.5),
rnorm(n = 33, mean = 5, sd = 0.5),
rnorm(n = 33, mean = 8, sd = 0.5)
)) %>%
mutate(group = "group3")
```

## Bad example
```{r}
Box <- rbind(
data1,
data2,
data3
) %>%
ggplot(aes(x = group, y = response)) +
geom_boxplot(aes(fill = group), alpha = 0.8, width = 0.7) +
scale_fill_manual(values = brewer.pal(8, "Set2")) +
labs(title = "Very similar!") +
theme_classic()
Box
```
## Good example
```{r}
Dots <- rbind(
data1,
data2,
data3
) %>%
ggplot(aes(x = group, y = response)) +
geom_quasirandom(aes(color = group), alpha = 0.8) +
scale_color_manual(values = brewer.pal(8, "Set2")) +
labs(title = "I guess not!") +
theme_classic()
Dots
```
# Wrap them together
```{r}
wrap_plots(
Box, Dots,
nrow = 1
) &
theme(legend.position = "none")
ggsave("../Results/BoxPlots_for_binomial.svg", width = 5, height = 2.5)
ggsave("../Results/BoxPlots_for_binomial.png", width = 5, height = 2.5)
```
Before making a box plot, one should check the distribution of their data,
since box plots focus on median and quartiles,
they cannot handle binomial data (and by extension data with multiple modes).
Ploting all the data points using `geom_quasirandom()` from the [ggbeeswarm package](https://github.com/eclarke/ggbeeswarm) is the best practice for small sample to moderate (less than tens of thousands) sample sizes,
as dots are robust to small sample sizes,
whereas distribution-based graphics such as violin plots and histograms are not.
See [this section](https://github.com/cxli233/FriendsDontLetFriends#2-friends-dont-let-friends-make-violin-plots-for-small-sample-sizes) and [this section](https://github.com/cxli233/FriendsDontLetFriends/tree/main?tab=readme-ov-file#friends-dont-let-friends-use-histogram-for-small-sample-sizes) for details.
124 changes: 124 additions & 0 deletions Scripts/Histogram_for_small_n.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
---
title: "Histogram_for_small_n"
author: "Chenxin Li"
date: "2024-12-10"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Friends don't let friends use histogram for small sample sizes

# Packages
```{r}
library(tidyverse)
library(RColorBrewer)
library(ggbeeswarm)
library(viridis)
library(patchwork)
```

# Data
```{r}
set.seed(666)
```

```{r}
n10 <- data.frame(
response = rnorm(n = 10)
) %>%
mutate(group = "n = 10")
n100 <- data.frame(
response = rnorm(n = 100)
) %>%
mutate(group = "n = 100")
n1000 <- data.frame(
response = rnorm(n = 1000)
) %>%
mutate(group = "n = 1000")
```

# Graphs
```{r}
bins10 <- rbind(
n10, n100, n1000
) %>%
ggplot(aes(x = response)) +
facet_wrap(~group, scales = "free", ncol = 1) +
geom_histogram(bins = 10, width = 0.7, color = "white", alpha = 0.8,
fill = viridis(n = 8, begin = 0.1, end = 0.8)[1]) +
labs(title = "10 bins") +
theme_classic() +
theme(panel.spacing = unit(1, "lines"),
strip.placement = "outside",
strip.background = element_blank(),
strip.text = element_text(hjust = 0))
bins10
```

```{r}
bins30 <- rbind(
n10, n100, n1000
) %>%
ggplot(aes(x = response)) +
facet_wrap(~group, scales = "free", ncol = 1) +
geom_histogram(bins = 30, width = 0.7, color = "white",
fill = viridis(n = 8, begin = 0.1, end = 0.8)[4]) +
labs(title = "30 bins") +
theme_classic() +
theme(panel.spacing = unit(1, "lines"),
strip.placement = "outside",
strip.background = element_blank(),
strip.text = element_text(hjust = 0))
bins30
```

```{r}
bins50 <- rbind(
n10, n100, n1000
) %>%
ggplot(aes(x = response)) +
facet_wrap(~group, scales = "free", ncol = 1) +
geom_histogram(bins = 50, width = 0.7, color = "white",
fill = viridis(n = 8, begin = 0.1, end = 0.8)[7]) +
labs(title = "50 bins") +
theme_classic() +
theme(panel.spacing = unit(1, "lines"),
strip.placement = "outside",
strip.background = element_blank(),
strip.text = element_text(hjust = 0))
bins50
```

# wrap them
```{r}
wrap_plots(
bins10, bins30, bins50 +
labs(caption = "\nWow, the appearance does change with different bin numbers."),
ncol = 3
) &
theme(plot.caption = element_text(size = 10))
ggsave("../Results/Histogram_for_small_n.svg", height = 6, width = 8)
ggsave("../Results/Histogram_for_small_n.png", height = 6, width = 8)
```
I've seen histogram being proposed as the replacement for bar plots.
However, a serious caveat for histogram is that histograms are not robust to bin numbers for small (and even moderate) sample sizes.
What is a histogram anyway? In a histogram, we first bin the data into a defined number of bins.
Then we count how many observations are there for each bin and graph them.

In this example, I sampled _the same_ normal distribution 3 times with different sample sizes (n = 10, 100, and 1000).
Even though they came from _the same_ normal distribution, the histograms look quite different based on the number of bins.
To showcase this, I plotted histogram for 10, 30, and 50 bins.

First of all, histogram makes no sense for small sample sizes. With small sample sizes (n < 30), the better practice is to graph all data points.
Second of all, you can see that the shape of the histogram is only robust to changing bin number when the sample size is fairly large (like 1000).
Even if n = 100, the appearance of the histogram can change drastically as the number of bins changes.

0 comments on commit dc346fb

Please sign in to comment.