- Produce scatterplots, boxplots, and histograms summarizing the Spotify data.
- Learn about universal and local plot settings.
- Use faceting effectively in ggplot2
First, lets load the packages and data we’ll need. If your R session is still open from the Data Manipulation session, you don’t need to do this.
ggplot2
is included in the tidyverse
library(tidyverse)
spotify <- read_csv("data/spotify.csv")
## Parsed with column specification:
## cols(
## genre = col_character(),
## energy = col_double(),
## loudness = col_double(),
## tempo = col_double(),
## danceability = col_double(),
## lyrics = col_character()
## )
ggplot2
is a plotting package that makes it simple to create complex
plots from data in a data frame. It provides a more programmatic
interface for specifying what variables to plot, how they are displayed,
and general visual properties. Therefore, we only need minimal changes
if the underlying data change or if we decide to change from a bar plot
to a scatter plot. This helps in creating publication quality plots with
minimal amounts of adjustments and tweaking.
ggplot2
functions like tidy data. Well-structured data will save you
lots of time when making figures with ggplot2
ggplot2
graphics are built step by step by adding new elements. Adding
layers in this fashion allows for extensive flexibility and
customization of plots.
To build a ggplot
, we will use the following basic template that can
be used for different types of plots:
ggplot(data = < DATA > , aes( < MAPPINGS > )) + geom_FUNCTION()
data
Bind the plot to a specific data frame using the data
argument
aes()
Use the aes()
(aesthetics) function to select the variables to be
plotted and specify how to present them in the graph, e.g. as x/y
positions or characteristics such as size, shape, color, etc.
geom
Add ‘geoms’ – graphical representations of the data in the plot (points,
lines, bars). Some examples that you will use today:
geom_point()
for scatter plots, dot plots, etc.geom_smooth()
for trendlinesgeom_boxplot()
for boxplotsgeom_histogram()
for histograms
To add a geom to the plot use the +
operator. This is somewhat similar
to %>%
. I allows you to be modular with your plots.
Lets get our hands dirty and make a quick scatterplot
ggplot(data = spotify, aes(x = energy, y = loudness)) +
geom_point()
It looks like song energy and loundness have a positive relationship!
The +
operator can really come in handy. It allows you to modify
existing ggplot
objects. That way you can set up plot templates and
explore different types of plots without having to reinvent the wheel.
Here’s an example:
# Assign a plot to an object. This is your template
energy_loudness <- ggplot(data = spotify, aes(x = energy, y = loudness))
# draw the plot. this will return the same plot as above
energy_loudness +
geom_point()
Now, if you want to just draw a trendline without points, you can!
energy_loudness +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Anything you put in the ggplot()
function can be seen by any geom
layers that you add (i.e., these are universal plot settings). This
includes the x- and y-axis mapping you set up in aes(). You can also
specify mappings for a given geom independently of the mappings defined
globally in the ggplot()
function.
Exercise 1
Create the same scatterplot as above (geom_point()
), but define the x
variable as energy and the y variable as tempo. What kind of trend do
you see?
Usually it takes some experimentation to get a plot looking just right.
ggplot
makes it easy to iteratively build your plot.
Let’s return to the scatterplot of energy and loudness:
ggplot(data = spotify, aes(x = energy, y = loudness)) +
geom_point()
There are a ton of points on the plot. You can avoid overplotting by
adjusting the transparency of the points with the alpha
parameter,
which you supply to the geom_point()
function. Feel free to play
around with the alpha
parameter. It ranges from 0 to 1, with 1 being
the most opaque.
ggplot(data = spotify, aes(x = energy, y = loudness)) +
geom_point(alpha = 0.2)
Interesting. The point density gets denser as the two variables
increase. Le’ts see if there may be some effects of genre
on this
relationship. We do this by supplying genre
to the color
aesthetic.
ggplot(data = spotify, aes(x = energy, y = loudness, color = genre)) +
geom_point(alpha = 0.2)
Woah buddy, that’s a lot of colors! Lets use our data filtering skills to reduce the data set to a more manageable five genres. Lets go with Folk, Rock, Soul, Rap, and Pop.
spotify_filtered <- spotify %>%
filter(genre == "Folk" | genre == "Rock" | genre == "Soul" | genre == "Rap" | genre == "Pop")
Let’s make the same scatterplot as before, but use this reduced data set. There are going to be fewer data points on the graph, so we can raise the value of alpha.
ggplot(data = spotify_filtered, aes(x = energy, y = loudness, color = genre)) +
geom_point(alpha = 0.8)
It looks like there aren’t differences in the relationship between energy and loudness across genera, at least for the genera we’re working with.
Exercise 2
Let’s take one last effort to see if there are differences in the
relationship between energy and loudness across groups. Rather than
plotting the full scatterplot, plot the trendlines with the
geom_smooth()
function.
What if you’re interested in the differences in the distribution of a single variable across genera? Boxplots help you do that.
Here is a quick look at how the distribution of danceability differs across genera.
bp <- ggplot(data = spotify_filtered, aes(x = genre, y = danceability))
bp +
geom_boxplot()
It looks like Pop, Rap, and Soul are the most danceable genera, but Folk and Rock aren’t so much.
Boxplots are great for
general trends, but they don’t represent the distribution of the data
super well. Violin plots do a
better job of this. Since you saved the last ggplot
object as bp
,
you don’t have to rewrite the long ggplot
call again!
bp +
geom_violin()
The distributions look unimodal, with some skew. This looks fine to me!
The data visualizations so far convey patterns in the data well enough, but they lack refinement. Let’s spruce a plot up to make it more professional.
First, I’m going to introduce the geom_histogram()
function, which
will allow you to visualize the distribution of a single variable.
Let’s take a look at danceability. Since you’re only visualizing one
variable, you only need to supply an argument to the x
aesthetic.
ggplot(data = spotify_filtered, aes(x = danceability)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
First we need to fix that notification. Picking binwidths is a tricky endeavor which can bias the representation of your distribution. For simplicity, let’s just pick the number of bins as 40.
ggplot(data = spotify_filtered, aes(x = danceability)) +
geom_histogram(bins = 40)
Now, let’s fix the axis labels. Capitalizing the first letter should do
the trick! While we’re add it, let’s add an informative title. We can
do all of this with the labs()
function.
ggplot(data = spotify_filtered, aes(x = danceability)) +
geom_histogram(bins = 40) +
labs(title = "Danceability of my favorite genera",
x = "Danceability",
y = "Count")
This is already looking better! The gray background isn’t very
pleasing, though. To remove this, the theme()
function will work, but
it can be a headache to tweak everything just the way you want it. See
here for the
insane list of options. Fortunately, ggplot
comes prepackaged with a
set of themes
where most of the decisions are made for you. There are a few to choose
from, but I’m going to make an executive decision and pick one of my
favorites, theme_minimal()
.
ggplot(data = spotify_filtered, aes(x = danceability)) +
geom_histogram(bins = 40) +
labs(title = "Danceability of my favorite genera",
x = "Danceability",
y = "Count") +
theme_minimal()
This looks almost publishable! The font sizes could use a little tweaking and I’m sure there are more interesting colors you can think of, but this is pretty good for a few lines of code!
Exercise 3
Try making a histogram like this yourself! Pick your favorite numeric
variable from the spotify_filtered
data set and either replicate the
above plot or tweak it and have some fun.
ggplot2
has a special technique called faceting that allows the user
to split one plot into multiple plots based on a factor included in the
dataset.
There are two types of facet functions:
facet_wrap()
arranges a one-dimensional sequence of panels to allow them to cleanly fit on one page.facet_grid()
allows you to form a matrix of rows and columns of panels.
The spotify data set doesn’t lend itself well to facet_grid
, so we’re
going to stick with facet_wrap
.
To use facet_wrap
, you need to supply it with the variable you want
the plots to be separated into. The variable needs to be wrapped in the
vars()
function, which tells facet_wrap
that this is a variable from
the data set being used to plot. A complete function looks like this:
facet_wrap(facets = vars(genre))
.
You’re now going to take the pretty histogram you made earlier and split it up according to the genera.
ggplot(data = spotify_filtered, aes(x = danceability)) +
geom_histogram(bins = 40) +
labs(title = "Danceability of my favorite genera",
x = "Danceability",
y = "Count") +
theme_minimal() +
facet_wrap(facets = vars(genre))
Nice! We can see that Rap, Pop, and Soul are a little skewed towards higher Danceability values.
Exercise 4
Can you try this with the filtered scatter plot?
Here’s the original code:
ggplot(data = spotify_filtered, aes(x = energy, y = loudness, color = genre)) +
geom_point(alpha = 0.8)
See if you can spruce the plot up a bit first, like we did with the histogram!
Extension If we have time, do this same faceted plot, but with
geom_smooth
rather than geom_point
Exercise 1
Doesn’t look like there is a strong trend! Maybe slightly positive.
ggplot(data = spotify, aes(x = energy, y = tempo)) +
geom_point()
Exercise 2
Looks like there is still no difference in the relationship across genera!
ggplot(data = spotify_filtered, aes(x = energy, y = loudness, color = genre)) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Exercise 3 There’s really no right answer to this. Have some fun!
Exercise 4
Here is one example of a faceted scatterplot:
ggplot(data = spotify_filtered, aes(x = energy, y = loudness, color = genre)) +
geom_point(alpha = 0.8) +
facet_wrap(facets = vars(genre))
Here are faceted trendlines:
ggplot(data = spotify_filtered, aes(x = energy, y = loudness, color = genre)) +
geom_smooth() +
facet_wrap(facets = vars(genre))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
There are many good data visualization resources to check out! Here are some of my favorites.