-
Notifications
You must be signed in to change notification settings - Fork 81
Using R to analyze your SQM results
You have just run SqueezeMeta and now you are ready to analyze the results. But... there are too many useful tables! How can you face them? Try the R SQMtools package to explore with ease your results with R!
Basic workflow of SQMtools package.
All the functions and options are fully explained in the manual of this package, but for simplicity we provide some examples here.
We have used 16 samples from Hadza hunter-gatherer (8) and italians (8) gut microbiomes (Rampelli, S., Schnorr, S. L., Consolandi, C., Turroni, S., Severgnini, M., Peano, C., ... & Candela, M. (2015). Metagenome sequencing of the Hadza hunter-gatherer gut microbiota. Current Biology, 25(13), 1682-1693).
NOTE:
Don't forget to change the paths depending on your own directory structure!
First of all, load the SQMtools R package and your data:
library('SQMtools')
Hadza = loadSQM('/media/disk6/natalia/Hadza16')
# Now, there is a faster option to load your data if your project is large:
Hadza = loadSQM('/media/disk6/natalia/Hadza16', engine = 'data.table')
This function creates a SQM object that gathers all the most valuable information of the SQM project. The structure of this object is shown down below. To access to all this data you would have need to open and parse at least three different tables!
SQM Object structure
Given that this object is a list, use the next syntax to access to any vector, dataframe or matrix:
data_analyzed = $lvl1$lvl2$lvl3
Some examples to help you:
# Matrix genera vs. abundances
genus_tax=Hadza$taxa$genus$abund
genus_tax[1:5, 1:5]
H1 H10 H11 H12 H13
Unclassified Candidatus Aenigmarchaeota 46 13 5 1 0
Unclassified Candidatus Altiarchaeota 0 0 0 0 2
Unclassified Candidatus Bathyarchaeota 10 238 74 34 17
Unclassified Candidatus Korarchaeota 8 0 0 0 0
Unclassified Candidatus Lokiarchaeota 4 30 8 5 24
# Matrix with COGs vs. abundances
COG_table=Hadza$functions$COG$abund
COG_table[1:5, 1:5]
H1 H10 H11 H12 H13
COG0001 2560 9376 2613 1382 2306
COG0002 7475 14024 3281 2022 6014
COG0003 48 95 22 3 4
COG0004 1446 4754 1111 652 1670
COG0005 3724 10696 2625 1592 5021
# Raw table from SQM: 13.<SQM project name>.orftable
orf_table=Hadza$orfs$table
orf_table[1:5, 1:5]
Contig ID Molecule Method Length NT Length AA
megahit_1_116-250 megahit_1 CDS Prodigal 136 46
megahit_2_3-284 megahit_2 CDS Prodigal 283 95
megahit_3_2-268 megahit_3 CDS Prodigal 268 90
megahit_4_1-261 megahit_4 CDS Prodigal 262 88
megahit_5_3-620 megahit_5 CDS Prodigal 619 207
Then, you can always make a summary of the project to discard occasional problems.
sum_Hadza = summary(Hadza)
sum_Hadza$project_name
[1] "Hadza16"
sum_Hadza$samples
[1] "H1" "H10" "H11" "H12" "H13" "H15" "H14" "H16" "IT1" "IT11"
[11] "IT13" "IT14" "IT2" "IT3" "IT4" "IT5"
sum_Hadza$reads_in_orfs
H1 H10 H11 H12 H13 H15 H14 H16 IT1 IT11 IT13 IT14 IT2 IT3 IT4 IT5
23847271 52692441 13718056 7833832 23807939 16412877 61547031 25641150 3191223 7090403 3415890 2472831 15259976 12563001 4362936 29855475
NOTE:
If you feel comfortable using R, please feel free to load your project and explore it on your way.
If not, we provide you with some functions that could facilitate the analysis:
Now, we are ready to explore our SQM project!
Plot taxonomy at the phylum and family levels:
plotTaxonomy(Hadza, rank='phylum', count='percent')
Taxonomy barplot at the phylum level. The abundance of each taxonomic group in the whole metagenome is shown in percentages.
plotTaxonomy(Hadza, rank='family', count='percent')
Taxonomy barplot at the family level.
Note: save your plots using R functions, please check the documentation!
# Option A
#Save a svg/png file
svg('myfigure.svg') # To save a png, it would be png('myfigure.png')
plotTaxonomy(Hadza, rank='family', count='percent') #Plot
dev.off()
# Option B
#Use ggplot2
library(ggplot2)
image = plotTaxonomy(Hadza, rank='family', count='percent')
ggsave(file='myfigure.svg', plot = image, width=10, height=8) #svg
ggsave(file='myfigure.png', plot = image, width=10, height=8) #png
But don't settle for this! We provide you with some extra features to customize the colours, scale percentages, choose the most abundant, or your favourite, taxa, plot or not the 'Unclassified' (contigs or genes that have not been taxonomically annotated) or the 'Other' taxon (an artificial taxon that groups the less abundant taxa).
# Picking up taxa without rescaling and without 'Other'
plotTaxonomy(Hadza, rank='phylum', count='percent', tax = c('Proteobacteria','Actinobacteria','Spirochaetes'), rescale = F, color = c('yellowgreen','tan3','cornflowerblue'), others = F)
Taxonomy barplot of the chosen taxa without rescaling percentages and without the 'Other' taxon.
# Choosing taxa rescaling and without 'Other'
plotTaxonomy(Hadza, rank='phylum', count='percent', tax=c('Proteobacteria','Actinobacteria','Spirochaetes'), rescale = T, color = c('yellowgreen','tan3','cornflowerblue'), others = F)
Taxonomy barplot of the chosen taxa rescaling percentages but without the 'Other' taxon.
Now, we know something about the microbial diversity in the samples, but what are they doing? We should check the functional profile(KEGG, COG and PFAM annotations) of the samples:
plotFunctions(Hadza, fun_level = 'KEGG', count = 'tpm', N = 15)
Most abundant KEGGs. The abundance of each function (KEGG) is counted in TPM (‘transcripts per million’), genes from each function per million genes in the whole metagenome.
You may find something useful analyzing some KEGGs with more detail. We observe that KEGG 'K07133', which is an 'Uncharacterised protein', has a higher abundance in Hadza samples. Which is the protein sequence annotated with this KEGG?
# Extract the orf annotated with that KEGG
orf = Hadza$orfs$table[Hadza$orfs$table$`KEGG ID` == 'K07133', 1:15] # see COGs, PFAM...
# Extract the sequence
sequence=Hadza$orfs$seqs[rownames(orf)]
megahit_1163508_371-1111
"MVEKSNIDMLNRKIYSYLRDFFETEKKALLVSGARQVGKTFAIRKVGAECFADVVEFNFLNNPKYREAFKSPSDAKEILLRLSALAEKKLIPGTTLVFFDEVQECPEMVTAIKFLVEEGSYRYVMSGSLLGVELKDIRSVPVGYMAERDENYKVEGYEKGMQRFYAYLSLFTMSMLGLVVATNIFQMYLFWELVGVCSYLLIGFYYPKHAAVHASKKAFKKLELDPPSHLATRCAMGTSHSYRHHR*"
So you can run a BLAST or try to find out more about its domains, make a multiple alignment with other proteins ... It's up to you!
At this point, is important to mention that it is possible to subset all the results related to a specific taxon, function or bin. The result is again a SQM object with the same structure described previously, but only with the information related to the selected feature. The best of it: most of the SQMtools functions will perfectly work on it and with the information that you really want!
We decided to pick up all contigs from 'Proteobacteria' and plot their taxonomy at the genus level and their most abundant functions:
proteo=subsetTax(Hadza, 'phylum',tax='Proteobacteria', rescale_copy_number = F)
plotTaxonomy(proteo, 'genus','percent', N=10, rescale = T, others = T)
Taxonomy barplot at the genus level of the 'Proteobacteria' subset. The abundance percentage of Proteobacteria is relative to the whole metagenome
plotFunctions(proteo, fun_level = 'KEGG',count = 'copy_number')
KEGG functional profile using copy number of the 'Proteobacteria' subset. The abundance of each function (KEGG) is counted in copy numbers (the average copies of each function per genome in the whole metagenome).
In this case, we subset all the genes whose functional annotations contain the word 'antibiotic':
antibiotic = subsetFun(Hadza, fun = 'antibiotic', rescale_copy_number = F)
plotTaxonomy(antibiotic, 'genus','percent', N = 10, rescale = T, others = T)
Taxonomy barplot based on the 'antibiotic' subset. The barplot shows the taxonomic distribution and percentage in the different samples of the ORFs containing functions whose annotations include the word 'antibiotic'.
plotFunctions(antibiotic, fun_level = 'KEGG',count = 'tpm')
KEGG functional profile using TPM of the 'antibiotic' subset.
Also, you can subset all the genes related to a particular KEGG pathway. Ex:
aromatic_aa = subsetFun(Hadza, fun = 'Phenylalanine, tyrosine and tryptophan biosynthesis', rescale_tpm = F,rescale_copy_number = F)
This is useful to analyze which taxa are involved in a specific KEGG pathway:
# Plot taxonomy
plotTaxonomy(aromatic_aa, 'genus', 'percent', N = 10, rescale = F, others = T)
Taxonomy barplot at the genus level of the ‘aromatic_aa’ subset. The barplot shows the taxonomic distribution and percentage in the different samples of the ORFs containing functions from the ‘Phenylalanine,tyrosine and tryptophan biosynthesis’ KEGG pathway.
plotFunctions(aromatic_aa, fun_level = 'KEGG', count = 'tpm')
Most abundant KEGGs in the ‘aromatic_aa’ subset. The abundance of each function (KEGG) is counted in TPM (‘transcripts per million’), genes from each function per million genes in the whole metagenome.
The function subsetFun also accepts regular expressions if the parameter fixed = F:
# Select genes annotated with K06001 OR K01696
trp_synthesis = subsetFun(Hadza, fun = 'K06001|K01696', fixed = F)
plotTaxonomy(trp_synthesis, rank = 'genus')
Taxonomy barplot at the genus rank of the genes annotated with the K06001 or the K01696 KEGGs.
And if you are interested in bins, you can subset them from the SQM object:
maxbin005 = subsetBins(Hadza, 'maxbin.005.fasta.contigs' )
# Plot functions
plotFunctions(maxbin005, fun_level = 'KEGG', count = 'tpm')
Most abundant KEGGs in the ‘maxbin005’ subset. The abundance of each KEGG is counted in TPM.
Eg: Subset genes from carbohydrates metabolism of Bacteroidetes and from Amino acid metabolism of Proteobacteria.
bacteroidetes = subsetTax(Hadza, 'phylum','Bacteroidetes')
bacteroidetes_carbohydrates = subsetFun(bacteroidetes, 'Carbohydrate metabolism')
proteobacterias = subsetTax(Hadza, 'phylum','Proteobacteria')
proteobacterias_aminoacids = subsetFun(proteobacterias, 'Amino acid metabolism')
bact_carbo_proteo_amino = combineSQM(bacteroidetes_carbohydrates, proteobacterias_aminoacids, rescale_copy_number = F, tax_source = "contigs")
bact_carbo_proteo_amino = combineSQM(bacteroidetes_carbohydrates, proteobacterias_aminoacids, rescale_copy_number = F)
plotTaxonomy(bact_carbo_proteo_amino, 'phylum','percent', N = 10, rescale = T, others = T)
Taxonomy barplot at the phylum level of the 'Carbohydrates metabolism of Bacteroidetes and Amino acid metabolism of Proteobacteria' subset.
plotFunctions(bact_carbo_proteo_amino, fun_level = 'KEGG', count = 'tpm')
TPM of most abundant KEGGs of the 'Carbohydrates metabolism of Bacteroidetes and Amino acid metabolism of Proteobacteria' subset.
If you don't feel comfortable using R, you have the option to export your tables in tsv format, which can be opened by different software such as Excel or LibreOffice Calc:
exportTable(bact_carbo_proteo_amino$functions$COG$tpm, 'Hadza_COG.tsv')
If you prefer exploring your results through Krona charts:
exportKrona(Hadza)
Snapshot of a Krona chart.
Other convenient function is the option of analyzing KEGG pathways. Clear up any doubt by checking if a KEGG pathway is complete or not in one or all the samples. Eg: KEGG PATHWAY: ko00400 'Phenylalanine, tyrosine and tryptophan biosynthesis' .
# To know if a KEGG is present or not in one sample,
# you can provide a vector of colors.
# Colors must be in the same order that samples
# To distinguish between Hadza and Italians:
# Repeat one color 8 times because the first 8 samples are from Hadza.
# The next 8 samples are from Italians, repeat their color.
colors = c(rep('#006682', 8), rep('#c26e00', 8))
exportPathway(Hadza, '00400', output_suffix = 'aromatic_aa', sample_colors = colors, max_scale_value = 3, count= 'copy_number')
Phenylalanine, tyrosine and tryptophan biosynthesis. Each square is an enzyme and it is divided in so many parts as samples are in the metagenome. Each enzyme is linked to one or more KEGGs. If the enzyme is present in one sample, its division is colored with the color of that sample. Hadza samples are shown in blue and Italian ones in orange. The hue of the color depends on the value: the darkest colors imply a copy number ≥ 3. White is for not present enzymes.
Also, it is possible to visualize the log2 fold-change of the KEGG abundances among two categories. This will help you to know if an enzyme is more abundant in one category (Hadza) or in the other (Italy):
# To calculate the log-fold change of a KEGG, you can provide a list of vectors.
# One vector with the samples names by condition:
#Hadza samples
H.samples = Hadza$misc$samples[grep('H', Hadza$misc$samples)]
IT.samples = Hadza$misc$samples[grep('IT', Hadza$misc$samples)]
condition = list(H.samples, IT.samples)
# Choose one color per condition
colors = c('#006682', '#c26e00')
# Plot log2 fold changes using copy number abundances in the selected KEGG pathway:
exportPathway(Hadza, '00400', output_suffix = 'aromatic_aa.log2FC', fold_change_colors = colors, fold_change_groups = condition, count ='copy_number', max_scale_value = 1.5)
Phenylalanine, tyrosine and tryptophan biosynthesis. Each square is an enzyme. The hue of the color depends on the value of the log-fold-change calculated using copy numbers: the darkest orange implies a log-fold-change value ≥ 1.5 (more abundant in Italian samples) and the darkest blue a log-fold-change ≤ -1.5 (more abundant in Hadza samples). White is for not present enzymes.
Are you wondering how to use the SQM Object with other packages? Keep on reading!
Which functions are more abundant in the project? We use the DESeq2 package to analyze which functions are significantly more abundant in Hadza or Italians samples. All you have to do is select the matrix with your data from the SQM object and make a metadata dataframe:
# Tutorial based on http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
# Load the package
library('DESeq2')
# Required: matrix with the raw abundances, colData (metadata file)
metadata=as.data.frame(c(rep('H',8), rep('I',8)))
rownames(metadata)=colnames(Hadza$functions$KEGG$abund)
colnames(metadata)='condition'
# Verify sample order
all(rownames(metadata)==colnames(Hadza$functions$KEGG$abund))
# Convert your data to the format required by DESeq2
dds = DESeqDataSetFromMatrix(countData = Hadza$functions$KEGG$abund, colData = metadata, design = ~ condition)
# Remove low abundant KEGGs:
keep = rowSums(counts(dds)) >= 10
dds = dds[keep,]
# Choose factor levels
dds$condition=factor(dds$condition, levels=c('H','I'))
# Run DESeq2
dds2=DESeq(dds)
results=results(dds2, name='condition_I_vs_H')
plotMA(results, ylim=c(-2,2))
Gene differential abundance using DESeq2. Red coloured points are those with an adjusted p value lower than 0.1. Points which fall out of the window are plotted as open triangles pointing either up or down.
Sometimes, it is useful to make an ordination plot or a cluster of your samples according to their similarity. If you use the 'vegan' package, you only have to take into account that you need to transpose the data.
library('vegan')
metadata = as.data.frame(c(rep('Hadza', 8), rep('Italy', 8)))
rownames(metadata) = colnames(Hadza$functions$KEGG$tpm)
colnames(metadata) = 'condition'
# Tranpose the matrix to have samples in rows.
kegg_tpm = t(Hadza$functions$KEGG$tpm)
MDS = metaMDS(kegg_tpm)
colvec = c('#006682','#c26e00') # colors
plot(MDS, display = 'sites')
with(metadata, points(MDS, display = 'sites', col = colvec[condition], pch = 19))
with(metadata, legend('topright', legend = levels(condition), col = colvec, pch = 21, pt.bg = colvec))
Non-metric multidimensional scaling (NMDS) ordination using KEGGs.
Additionally, you can run different statistical tests (eg: PERMANOVA test) to know if some variable (like the location) is affecting the clustering of samples in the ordination:
library('vegan')
metadata = as.data.frame(c(rep('Hadza', 8), rep('Italy', 8)))
rownames(metadata) = colnames(Hadza$functions$KEGG$tpm)
colnames(metadata) = 'condition'
# Tranpose the matrix to have samples in rows.
kegg_tpm = t(Hadza$functions$KEGG$tpm)
country = c(rep('Hadza', 8), rep('Italy', 8))
adonis(kegg_tpm~country)
Call :
adonis (formula = kegg_tpm ~ country)
Permutation : free
Number of permutation: 999
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
country 1 0.034942 0.034942 7.9894 0.36333 0.001 ***
Residuals 14 0.061231 0.004374 0.63667
Total 15 0.096173 1.00000
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
According to this, samples from the same location are more similar in their functional annotations (KEGGs).
This tutorial provides you with some examples. Now you have a tool to explore your SQM project even though you don't know so much about R. Good luck!!
Note: Save your R workspace
save.image('mySession.RData') #Save R workspace
load('mySession.RData') #Load a previous R workspace
The SqueezeMeta paper at: https://www.frontiersin.org/articles/10.3389/fmicb.2018.03349/full
The SQMtools paper at: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03703-2