Cheese has been produced with, and without maturation culture to accelerate/differentiate maturation. Aroma profiling is obtained from cheeses of age 4, 7 and 10 weeks.
It is a good idea to start the script with setting the libraries you are going to use and then import data
library(rio)
library(ggplot2)
X <- import('data/cheese_aromas.xlsx')
Use table() to figure out how many samples there are in the combination of time and maturation culture
Produce two sets of summary stats: First, descriptive stats for one variable for each design cell and Second, the mean value for all aroma compounds across all design cells.
Use aggregate() for both. Here you get some inspiration
aggregate(X$`1Butanol`, list(X$ti...,X$...),function(x) c(length(x), mean(x),...))
Xm <- aggregate(X, list(X$...),...)
aggregate(X$`1Butanol`,
list(X$time_weeks,X$maturation_culture),
function(x) c(length(x), mean(x), sd(x), min(x), max(x)))
## Group.1 Group.2 x.1 x.2 x.3 x.4 x.5
## 1 4 culture 4.00000000 7.55126674 0.04391576 7.49283452 7.58820478
## 2 7 culture 5.00000000 7.39114520 0.04358967 7.34402937 7.44951422
## 3 10 culture 6.00000000 7.31741925 0.06783257 7.21151920 7.40316816
## 4 4 none 6.00000000 7.55678481 0.03004764 7.51631387 7.60017655
## 5 7 none 6.00000000 7.73671457 0.05523241 7.67189254 7.80647240
## 6 10 none 5.00000000 7.89461207 0.07581167 7.77225843 7.96464356
Xm <- aggregate(X, list(X$time_weeks,X$maturation_culture),mean)
We wish to make some plots of the raw data emphasizing the experimental design. So make a plot using ggplot() (from the ggplot2 package) to plot the design on the x-axis and some aroma compound on the y-axis, with a boxplot in the background and the individual observations as points.
Here you have some code for inspiration
ggplot(data = X, aes(time_culture,...)) +
geom_boxplot() + ...
ggplot(data = X, aes(time_culture,`1Butanol`, fill = maturation_culture)) +
geom_boxplot() +
geom_point()
- Try to change the x and y labels to make your plot look nicer (hint: add ylab("Your label name here")). To change the size of your axis titles, add: theme(axis.title=element_text(size=14))
ggplot(data = X, aes(time_culture,`1Butanol`, fill = maturation_culture)) +
geom_boxplot() +
geom_point() +
ylab('Response - 1-Butanol') +
theme(axis.title=element_text(size=14))
- Extra: What do you think the following code does? Try to add it to your plot:
theme(axis.text.x = element_text(angle = 45, hjust = 1))
- Extra: If there is time, play around with the geoms, and try change your previous plots to these (or other variations, if you prefer):
library(gridExtra)
p1 <- ggplot(data = X, aes(time_culture,`1Butanol`, fill = maturation_culture)) +
geom_boxplot() +
geom_jitter(color="blue") +
theme(legend.position="top") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ylab("1-Propanol") + xlab("1-Butanol") +
labs(fill='Maturation culture')
p2 <- ggplot(data = X, aes(time_culture,`1Butanol`, color = maturation_culture)) +
geom_violin() +
theme(legend.position="top") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ylab("1-Propanol") + xlab("1-Butanol") +
labs(color='Maturation culture')
grid.arrange(p1,p2,ncol=2)
Try to make the following, and see what is going on.
ggplot(data = X, aes(`1Butanol`, `1Propanol`,
color = factor(time_weeks),
shape = maturation_culture)) +
geom_point() +
stat_ellipse()
-
Try to change the size of the points. Note the difference if you put the size argument inside our outside of the aes() environment. Why do you think that is?
-
Try to make a scatter plot as above, but now add a (straight) line through the points by adding stat_smooth(). If you only want one line, then the coloring and shapes should be removed.
Could look something like this:
ggplot(data = X, aes(`1Butanol`, `1Propanol`)) +
geom_point() +
stat_smooth(method='lm') +
ylab('1-Propanol') +
xlab('1-Butanol')
- Difficult: can you figure out if it is possible to have a plot with colors, shapes and ellipses BUT only one straight line?
In order to avoid the grouping to also go onto stat_smooth(), the groups are overrulled in the stat_smooth() by setting it to something with Only one level.
ggplot(data = X, aes(`1Butanol`, `1Propanol`, color = time_culture, shape = maturation_culture)) +
geom_point() +
stat_ellipse() +
stat_smooth(method='lm', aes(group = 1)) +
ylab('1-Propanol') +
xlab('1-Butanol')
- Extra: Sometimes it is not a good idea to overload plots with information. In this case using both a color and a shape to convey information can be confusing for the reader. Try to remove the shapes and use facet_wrap(~maturation_culture) to separate the two maturation types into two plots.
We remove the shape aestetics, but still forces there only to be one line per plot. This however, is tied to the individal plots and are not the same between the two plots.
ggplot(data = X, aes(`1Butanol`, `1Propanol`, color = time_culture)) +
geom_point() +
stat_ellipse() +
facet_wrap(~maturation_culture) +
stat_smooth(method='lm', aes(group = maturation_culture)) +
ylab('1-Propanol') +
xlab('1-Butanol')
Instead of plotting the raw data, lets plot the mean value and put on some error-bars. Still with the x-axis and y-axis being the same.
We need a bit tricky version of aggregate() to make it work. And further some renaming.
Xag <- do.call(data.frame,
aggregate(X$`1Butanol`,
list(X$time_weeks,X$maturation_culture),
function(x) c(mean(x), sd(x)))
)
colnames(Xag) <- c('time','culture','mn','sd')
- Try to make a plot that looks like the one below.
Some code to get you started (hint: layer your plot one geom at a time, and use geom_errorbar() to add the errorbars):
ggplot(data = Xag, aes(time,mn, color = culture, ymin = mn - sd, ymax = mn+sd)) +
...
ggplot(data = Xag, aes(time,mn, color = culture, ymin = mn - sd, ymax = mn+sd)) +
geom_point() +
geom_errorbar() +
geom_line()
In the figure above it is not clear if week 7 and 10 for samples with added culture are significantly different. To find out we can perform a perform a T-test. T-tests can be done by using the function t.test - try to use the commands ?t.test or help(t.test) for help. Try to make a t-test on the difference in 1-butanol between week 7 and 10 for samples with culture.
You could consider to subset the original data.
Xtemp <- subset(X,...)
w7 <- subset(Xtemp,...)
w10 <- ...
t.test(w7$`1Butanol`...)
Or in a more comprehensive way, simply only include the data which you want to use.
t.test(data = X[X$time_weeks %in% c(7,10) & X$maturation_culture=='culture',],
`1Butanol` ~ time_weeks)
##
## Welch Two Sample t-test
##
## data: 1Butanol by time_weeks
## t = 2.177, df = 8.5567, p-value = 0.05898
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.003492978 0.150944876
## sample estimates:
## mean in group 7 mean in group 10
## 7.391145 7.317419
In any case you should get exactly the same answer.
Xtemp <- subset(X,maturation_culture=="culture")
w7 <- subset(Xtemp,time_weeks == c("7"))
w10 <- subset(Xtemp,time_weeks == 10)
t.test(w7$`1Butanol`,w10$`1Butanol`)
##
## Welch Two Sample t-test
##
## data: w7$`1Butanol` and w10$`1Butanol`
## t = 2.177, df = 8.5567, p-value = 0.05898
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.003492978 0.150944876
## sample estimates:
## mean of x mean of y
## 7.391145 7.317419
In the above example there were three time points and not only the two used for the t.test. If we want to analyze the effect of time we should to an analysis of variance ANOVA. This can be done by making a linear model with lm(y~x). This model can yield the ANOVA table by being passed to the anova() function.
-
Try to do a ANOVA on the culture samples with all time points (should time points be continuous or discrete?)
-
subset(X...)
-
lm(1-butanol~time)
-
anova(mdl)
-
Do the model assumptions hold? (Normally distributed residuals, try plot(mdl))
-
Extract the coefficients and compare them to the line plot above.
Xc <- subset(X,maturation_culture=="culture")
mdl <- lm(data = Xc, `1Butanol`~factor(time_weeks)-1)
anova(mdl)
## Analysis of Variance Table
##
## Response: 1Butanol
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(time_weeks) 3 822.50 274.166 90404 < 2.2e-16 ***
## Residuals 12 0.04 0.003
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mdl)
##
## Call:
## lm(formula = `1Butanol` ~ factor(time_weeks) - 1, data = Xc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.105900 -0.031603 -0.007959 0.033692 0.085749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## factor(time_weeks)4 7.55127 0.02753 274.2 <2e-16 ***
## factor(time_weeks)7 7.39115 0.02463 300.1 <2e-16 ***
## factor(time_weeks)10 7.31742 0.02248 325.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05507 on 12 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 0.9999
## F-statistic: 9.04e+04 on 3 and 12 DF, p-value: < 2.2e-16
coefficients(mdl)
## factor(time_weeks)4 factor(time_weeks)7 factor(time_weeks)10
## 7.551267 7.391145 7.317419
par(mfrow=c(2,2)); plot(mdl)
This could also be analyzed as a two-way anova as there are in fact two design (experimental) factors!
mdl2 <- lm(data = X, `1Butanol`~maturation_culture*time_weeks)
plot(mdl2)
anova(mdl2)
## Analysis of Variance Table
##
## Response: 1Butanol
## Df Sum Sq Mean Sq F value Pr(>F)
## maturation_culture 1 0.79219 0.79219 258.6615 1.128e-15 ***
## time_weeks 1 0.02668 0.02668 8.7125 0.006328 **
## maturation_culture:time_weeks 1 0.41322 0.41322 134.9225 3.199e-12 ***
## Residuals 28 0.08575 0.00306
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we want to make a PCA model on the multivariate data ( 22 GC features ). That is a low-rank, or simple representation of what goes on in all 22 directions in (here) two dimensions.
The PCA model it self can be calculated using build-in functionality in R, but the plotting requires some additional packages.
There are several places, where packages are made available for download and use. One of the is github. The package that we are going to use is appearing here. One really nice thing in R, is that the download and installation of packages is done directly from R. So no need to use a browser, point-and-click. However, you will need another package for getting the package. Namely the devtools package.
install.packages('devtools')
Then using this library we can get ggbiplot
library(devtools)
install_github('vqv/ggbiplot')
Use the prcomp() function to make a PCA model on the response variables DO NOT include design! Further, remember to pre-treat the input data using centering and scaling.
PCAmdl <- prcomp(X[,4:...], scale. = T, center = T)
A vanilla version plot of the PCA model looks as follows:
library(ggbiplot)
ggbiplot(PCAmdl)
As this function is engined by ggplot() most of the modification that can be used by ggplot() can also be used here. Specifically, you might want to extend the axes in order to be able to see the labels on the loadings. That is done by adding xlim(c(low_limit,high_limit)) to the plot (similarly for ylim()). Try to do this.
The ggbiplot() function can included classes for the scores (that is if the samples are from different groups). In this case we a design, so try to incorporate this via the groups = argument in the function. Further, set the argument ellipse = TRUE to get some even nicer representation.
ggbiplot(PCAmdl, groups = X$time_culture, ellipse = TRUE)
Although ggbiplot() fulfills most needs in terms of PCA visualization, you might want to make the plot even specific. In that case, the way to go, is to 1) export the scores or loadings and put them into a data.frame and 2) use ggplot() to make exactly the version you want.
As an example of something which might could be useful, is three individual score plots; one for each time point.
In order to make that we do:
scores <- data.frame(X, PCAmdl$x) # extract scores and put into data.frame combined with design (and original data)
head(scores,2)
## maturation_culture time_weeks time_culture X1Butanol X1Propanol
## 1 culture 4 04uger_culture 7.542316 4.108593
## 2 culture 4 04uger_culture 7.492835 4.142102
## X2Butanol Isopropyl_Alcohol X1Pentanol X2Pentanol Acetone Butanal
## 1 7.025338 6.479134 6.602876 4.872258 6.689937 4.867165
## 2 6.951893 6.625942 6.578138 4.792159 6.994308 4.957335
## Butanal_2methyl Butanal_3methyl Propanal_2methyl X2Heptanone
## 1 6.572955 7.125086 6.394694 7.078801
## 2 6.870942 7.375913 6.678770 7.091987
## Butanoic_acid_3methyl Butanoic_acid Acetic_acid Pentanoic_acid
## 1 3.490007 2.826384 5.223277 1.725756
## 2 3.169435 2.123829 4.905353 1.546119
## Formic_acid_butyl_ester Propanoic_acid_2hydroxy_ethyl_ester
## 1 5.922506 1.886282
## 2 5.915372 1.762319
## Acetic_acid_ethenyl_ester Pentanal X2Butanone Nonane PC1
## 1 7.008109 6.839849 8.404732 5.446106 -3.199387
## 2 6.839216 6.920548 8.410900 5.771083 -2.209928
## PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 1 -0.1222409 2.193389 -1.689694 -0.2408070 0.6711684 -0.01894841 0.7477768
## 2 0.2324992 3.124559 -2.220143 -0.5838099 0.7608341 0.61886464 0.5762888
## PC9 PC10 PC11 PC12 PC13 PC14
## 1 -0.01172872 -0.005058009 -0.08848177 0.06731906 0.04453961 0.07121666
## 2 0.72343630 -0.742602661 0.19935638 0.05851943 0.38476215 -0.37115335
## PC15 PC16 PC17 PC18 PC19
## 1 -0.0464973288 0.09346433 -0.02116017 -0.04722489 -0.008143801
## 2 0.0006669017 0.25174015 -0.07465636 0.06286470 -0.029978391
## PC20 PC21 PC22
## 1 -0.03668081 -0.07363705 0.04123002
## 2 -0.03161958 0.07170486 -0.02150698
You see that in addition to the original data, columns with PC1, PC2,..., PC22 is added.
Lets plot them:
ggplot(data = scores,aes(PC1,PC2, color = maturation_culture)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0)
Now try to make modifications to the code to get something like this.
HINT: you will need to play around with facet_wrap() or facet_grid(), stat_ellipse(), theme_bw() and some general settings in theme().
Here, we only have 22 variables, but we might have situations where there are way more. In such situations it would be nice to make the plot interpretable without having to scrutinize to much. One such way, is to add information on the points. Here in the form of classes:
These variables are aroma compounds and can be subdivided into the chemical categories:
- Alcohols
- Aldehydes
- Ketones
- Acids
We would like to infer this information in different ways, but first of all we need to set up the loadings in a data.frame() and define class from the name of the variable.
loadings <- data.frame(PCAmdl$rotation)
loadings$name <- rownames(loadings)
loadings$class <- 'No Class'
# Alcohols
loadings$class[grepl('ol',loadings$name)] <- 'Alcohols'
# Aldehydes
loadings$class[grepl('al',loadings$name)] <- 'Aldehydes'
# Ketones
loadings$class[grepl('one',loadings$name)] <- 'Ketones'
# Acids
loadings$class[grepl('acid',loadings$name)] <- 'Acids'
table(loadings$class)
##
## Acids Alcohols Aldehydes Ketones No Class
## 7 6 5 3 1
loadings$name[loadings$class=='No Class']
## [1] "Nonane"
Lets plot the loadings, color- and shape according to molecular class and put on labels.
ggplot(data = loadings, aes(PC1, PC2, color = class, label = name, shape = class)) +
geom_point(size = 10) +
geom_text() +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0)
You are pretty happy with this plot, but not totally satisfied!
Try to play around with label size, coloring, theme and label position to get things more clear.
Something like below is doable but how?
HINT: For the displacement of the labels you'll need a package called ggrepel.
A HARD one: The de-selection of some labels is a pretty hard task, but try to google and see if you can figure this one out!
Import the ggrepel package from where it is located (google it!), then the code below does the job, but please understand the individual parts of it. I.e. what does each line do to the plot?