"Nextflow workflows for the analysis and discovery of viral sequences from metagenomes and viromes"
ViromeXplore conatains a set of bioinformatic workflows that can be used reproducibly, sequentially and interoperably between them and with other workflows. It allows multiple input types mand starting points that include raw reads, contigs, genomes, abbundance tables, phylogenies and taxonomies. ViromeXplore contains the most varied and comprehensive workflows to explore viral genes and genomes from viromes and metagenomes to date. The future aim for ViromeXplore is to continue expanding the workflows and including the most recent and efficient tools.
The workflows are implemented in Nextflow, and additionally only Docker or Singularity are needed to run ViromeXplore. Installation and usage details are provided below.
ViromeXplore contains the following workflows that can be used in different combinations according to the user’s needs:
Contamination estimation and read classification.
Virus assembly.
Viral sequence identification.
Viral genome abundance estimation and elongation.
Taxonomic assignment and functional annotation.
Network and phylogeny-based host prediction.
A detailed representation of the workflows is shown in the diagram below.
The workflows require Nextflow and either Singularity (default) or Docker to be executed. To install this software, use the commands below. Alternatively, they can be installed using Conda, although this method is not recommended.
curl -s https://get.nextflow.io | bash
sudo mv nextflow /usr/local/bin/
nextflow -version
sudo apt-get update
sudo apt-get install -y \
ca-certificates curl gnupg lsb-release
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io
docker --version
export VERSION=3.10.3
wget https://github.com/sylabs/singularity/releases/download/v${VERSION}/singularity-${VERSION}.tar.gz
tar -xzf singularity-${VERSION}.tar.gz
cd singularity
./mconfig && make -C builddir && sudo make -C builddir install
singularity --version
Run the workflows as follows:
nextflow ViromeXplore.nf --pipeline qc_classify --reads basename_{1,2}.fastq
nextflow ViromeXplore.nf --pipeline viral_assembly --reads basename_{1,2}.fastq
nextflow ViromeXplore.nf --pipeline find_viruses --contigs contigs.fasta
nextflow ViromeXplore.nf --pipeline high_quality_genomes --reads basename_{1,2}.fastq --contigs contigs.fasta --viral_contigs viral_contigs.fasta
nextflow ViromeXplore.nf --pipeline taxonomy_annotation --viral_contigs viral_contigs_or_genomes.fasta
nextflow ViromeXplore.nf --pipeline host_prediction --phylogeny viral_phylogeny.nwk --taxonomy host_taxonomy.tsv --matrix virus_host_abbundances.tsv
Containers are available for all processes. To launch a run, use Singularity (default) or Docker by specifying the appropriate profile:
- For Docker: use
-profile docker
- For Singularity: use
-profile singularity
Depending on the cluster system it is also possible to use slurm in the profile which enables the user to configure the resources for each step. For this the user should modify the config/local.config
file and run the profile as follows:
-profile singularity,slurm
-profile docker,slurm
Include the selected profile option when running the workflow script.
--pipeline
Valid pipeline name:
[qc_classify / viral_assembly / find_viruses / high_quality_genomes / taxonomy_annotation / host_prediction]
--reads
Reads in FASTQ format:
['basename_{1,2}.fastq']
--contigs
Contigs file in FASTA format:
['file.fasta']
--reads
Reads in FASTQ format:
['basename_{1,2}.fastq']
--contigs
Contigs file obtained from assembly:
['file.fasta']
--viral_contigs
Viral classified contigs or genomes to extend:
['file.fasta']
--viral_contigs
Viral classified contigs or genomes:
['file.fasta']
--phylogeny
Phylogenetic tree for the viruses being analyzed (NEWICK format):
['virus_phylogeny.nwk']
--taxonomy
Lineage of host in NCBI terms (TAB DELIMITED file with ID and lineage):
['taxonomy_file.tsv']
--matrix
Matrix containing the abundances of viruses and hosts (TAB DELIMITED):
['matrix_abundances.tsv']
(Columns correspond to taxa and rows to samples).
--result_dir
Name of directory where the results from all analyses will be written.
[default: results]--cpus
Number of CPUs to use during the job.
[default: all available]--memory
Memory in GB to be assigned for the job.
[default: 12 GB]--help
Help statement.
-
qc_classify
Pipeline to detect non-viral contamination and viral read classification.
(ILLUMINA files required) -
viral_assembly
Pipeline for virome read QC and assembly.
(ILLUMINA files required) -
find_viruses
Pipeline for viral sequence identification and annotation.
(FASTA contig file required) -
high_quality_genomes
Pipeline for obtaining viral contig abundance and improving genomes.
(FASTA contig file required, FASTA viral contig file required, ILLUMINA files required) -
taxonomy_annotation
Pipeline for assigning viral contigs/genomes taxonomy and gene annotations.
(FASTA viral contig/genome file required) -
host_prediction
Pipeline to determine virus-host pairs using co-occurrence and phylogeny.
(ABUNDANCE tsv matrix, PHYLOGENY newick tree, HOST TAXONOMY NCBI terms and host ID required)
If you have used the ViromeXplore please consider citing: