written by: Elli Hung and Chloe Nichole Calica
[10 minutes] MGnify is a hub avaliable for the analysis and exploration of nucleic acid sequences drawn from user-submitted sequences and the European Nucleotide Archive (ENA) that specifically are related to microbiome studies. Microbiome research focuses on the study of genetic material from micro-organisms within specific environments allowing for researchers to study microbial communities, processes the community undergoes, and complex interactions.
MGnify supports two types of searches:
- Text Search using names, biomes, projects, samples, or keywords
- Sequence Search with a user inputted protein query sequence in FASTA format to run against the MGnify database of predicted proteins obtained from assembly analysis.
MGnify can aid virus discovery by analyzing metagenomic data to identify viral sequences within microbial communities. This is crucial for uncovering novel viruses, understanding their interactions with hosts and the environment,exploring their role in microbiomes and giving more ecological context behind the novel virus.
Tutorial Objective: This tutorial will cover both types of searches using two different inputs. In the Text Search
, a study investigating the Impact of diet gene expression in bovine rumen microbiomes, specifically a run where an obelisk was found, will be used while Sequence Search
will be used against a full length sequence of Arginine deiminase from Streptococcus sanguinis derived from MGnify example data.
- Access to MGnify Text Search
- Study Details:
- bioProject:
PRJEB7104
- SRA run:
ERR747931
- bioProject:
Given a search term, MGnify will output results showing those projects or samples with metadata containing that term. Each result is indexed by a MGnify identifier prefixed with MGY
, a letter corresponding to the data type (S
for study and A
) for analysis and a unique eight-digit number.
On the text box, enter the bioProject accession: PRJEB7104
then click Search
In the results table, you should see the bioProject accession under ENA accession
and its corresponding MGnify ID
. Click this id and you will be redirected to the overview page for this study.
On the overview page of the study, there are four sections:
External Links
- a link to the project submission on ENAClassification
- how the study is classified withinMGnify
. In this case, it is classified asroot:Host-associated:Mammals:Digestive system:Large intestine:Fecal
.Description
- title of the study and where it was conductedProgrammatic Access
- links on analyzing the data via R or PythonAnalyses
- a table of all the samples/runs associated with this project.
5. In the Analyses
table, locate the sample with the SRA accession ERR747931
under the Run/Assembly accession
column. Click its Analysis accession
code prefixed with MGYA
.
In the analysis page of the run, there are six tabs containing the following:
Overview
- description of the study, sample, run, and pipeline used as well as experimental details.
Quality control
- processing steps performed on the sequencing data and has graphs on length/GC distribution of the sequences and nucleotide abundance on each position.Taxonomic analysis
- has charts dsiplaying the taxonomic assignments for the run based on the small or large subunit rRNA.
For our run, only one chart is generated for the small subunit RNA. The KRONA chart shown below helps us identify which taxa are most prevalent at various taxonomic level and gives us insight into the microbial community's diversity.
Functional analysis
- contain summaries on the fucntional content of the sequences in the sample with a focus on InterPro and GO term annotations.
Abundance and Comparison
- information on metagenomic community diversity estimation, allowing for comparisons across study runs.Download
- contains download links for the datasets used in the analysis. They are grouped into the following sections: Sequence data, Functional analysis, Taxonomic analysis, Statistics, and non-coding RNAs.
That's it! You've used the MGnify Text Search
to obtain more information on a microbiome sequence.
From this tutorial and the analysis provided by MGnify Text Search
, you should have gathered insights into the taxonomic diversity (e.g., species richness and evenness), functional potential (e.g., metabolic pathways), and ecological dynamics of the microbiome under investigation. This platform provides a comprehensive view of the microbial ecosystem and its potential roles in the environment or host system.
- Access to MGnify Sequence Search
- Amino acid sequence in FASTA format Link to example data
Information about the host, environment, biome, and associated studies that match the user's search query within the MGnify hub of microbiome data. In our case, we will determine what host and/or environment this amino acid sequence for Arginine deiminase is found in and the associated studies with this sample.
1. Navigate to MGnify Sequence Search
We will input a FASTA-formatted amino acid sequence into the query box based on the example data.
There are different databases that a user can choose from to search their query against:
-
Sequence type
- All sequences - all sequences in the database
- Partial sequences - only partial sequences
- Full length sequences - only full length peptides
- read more about partial and full length peptides
-
Environments
- Aquatic
- Marine
- Freshwater
- Soil
-
Host-associated biome
- Human
- Human - digestive system
- Human - non-digestive
- Animal
-
Other (sequences not found in other environment or biome categories)
- Engineered
- Other
In this example, we will search only for full length sequences.
A list of matching sequences is presented that is ordered by E-value significance (lower E-value indicating a more significant match). The user can customize this table to display different headings. Clicking on the Target
link shows the amino acid FASTQ file that the query sequence matched to. the Run & Sample IDs
link brings up the associated MGnify Sample Overview.
In our example, 2368 significant query matches were found. The top match has a bit score of 890.8 and an E-value of 3.5e-265.
Clicking on the MGnify Sample Overview link to the associated query match displays a description of the sample, links to external websites (such as ENA or EBI biosample), the classification, associated studies, and other data.
In our example the description shows that this is a human metagenome sample from G_DNA_Supragingival plaque of a male participant in the dbGaP study "HMP Core Microbiome Sampling Protocol A (HMP-A)." The report shows that this sample is classified as human host-associated and there is a Google maps option available if the sample information includes coordinates. Additional information about the metadata and associated studies is also available.
That's it! You've used MGnify Search
to obtain information about the associated host and environment of an amino acid sequence!
Here we have run through a tutorial on how to use the MGnify Search
online resource to search through the MGnify hub of microbiome data to determine more contextual information about the host, biome, and environment of a specific amino acid query.
- MGnify: the microbiome sequence data analysis resource in 2023 publication for the latest update on MGnify
- MGnify Online Tutorial for a more detailed tutorial of the entire MGnify hub
- MGnify Sequence Search Documentation for more information on the sequence search parameters and customization options