Spriggan

Spriggan is a Nextflow pipeline for the assembly of bacterial whole genome sequence data and identification of antibiotic resistance genes.

Usage

The pipeline is designed to start from raw, paired-end Illumina reads. Start the pipeline using:

nextflow spriggan/main.nf --input [path-to-samplesheet] --outdir [path-to-outdir] -profile [docker,singularity,aws]

You can specify a version of the pipeline and run it directly from the github repository by using:

nextflow wslh-bio/spriggan -r [version] --input [path-to-samplesheet] --outdir [path-to-outdir] -profile [docker,singularity,aws]

You can also test the pipeline with example data using -profile test or -profile test_full:

nextflow spriggan/main.nf --outdir [path-to-outdir] -profile test[_full],[docker/singularity]

Input

Spriggan's inputs are paired Illumina FASTQ files for each sample and a comma separated sample sheet containing the sample name, the path to the forward reads file, and the path to the reverse reads file for each sample. An example of the sample sheet's format can be seen in the table below and found here.

sample	fastq_1	fastq_2
sample_name	/path/to/sample_name_R1.fastq.gz	/path/to/sample_name_R2.fastq.gz

Parameters

Spriggan's main parameters and their defaults are shown in the table below:

Parameter	Parameter description and default
input	Path to comma-separated file containing information about the samples in the experiment.
outdir	Output directory where the results will be saved. Absolute path must be used for storage on cloud infrastructure.
qualitytrimscore	Minimum read quality for trimming (default: 10)
trimdirection	Read trimming direction (default: "lr")
minlength	Minimum read length for trimming (default: 10)
contaminants	Path to FASTA file of contaminating sequences for trimming
mincoverage	Minimum coverage threshold to pass a sample (default: 40)
kraken_db	Path to Kraken database for classification
plus	Use AMRFinderPlus' --plus option (default: false)
selected_genes	Genes of interest to pull from AMRFinderPlus output (default: 'NDM\|OXA\|KPC\|IMP\|VIM')
ncbi_assembly_stats	Path to NCBI database (default: NCBI_Assembly_stats_20240124.txt)

Workflow outline

Read trimming and quality assessment

Read trimming and cleaning is performed using BBtools v38.76 to trim reads of low quality bases and remove PhiX contamination. Then FastQC v0.11.8 is used assess the quality of the raw and cleaned reads.

Genome assembly

Assembly of the cleaned and trimmed reads is performed using Shovill v1.1.0.

Assembly quality assessment

Quality assessment of the assemblies is performed using QUAST v5.0.2.

Genome coverage

Mean and median genome coverage is determined by mapping the cleaned reads back their the assembly using BWA v0.7.17-r1188 and calculating depth using samtools v1.10.

Antimicrobial resistance gene detection

Antimicrobial resistance genes, as well as point mutations, are identified using AMRFinderPlus v3.10.30. Using the plus parameter provides results from the AMRFinderPlus "--plus" option, which includes genes such as virulence factors, stress-response, etc.

Spriggan can also generate a table of results for genes of interest with the selected_genes parameter. Spriggan will search for matches to the gene(s) of interest in the AMRFinderPlus results and make a separate table called selected_ar_genes.tsv. The list of genes must be separated by | and enclosed in single quotes in the config file.

MLST scheme

MLST scheme is classified using MLST v2.17.6. Multiple schemes are available for specific organisms, and STs from all available schemes are reported for those organisms.

Contamination detection

Contamination is detected by classifying reads using Kraken2 v2.0.8 with the Minikraken2_v1_8GB database. A custom Kraken database can be used with the kraken_db parameter.

Assembly calculations

Calculations are performed with Pandas v1.3.2 on Kraken2 v2.0.8 and QUAST v5.0.2 data to determine the expected : actual assembly length ratio, actual : expected assembly length ratio, and GC content statistics. The NCBI Assembly statistics database is referenced during these calculations. The expected : actual length ratio is included in the results/spriggan_report.csv.

Summary

Results are summarized using MultiQC v1.11 and Pandas v1.3.2. The main outputs of Spriggan are a csv file named spriggan_report.csv and an HTML report file named spriggan_multiqc_report.html. The spriggan_report.csv file summarizes the results of the QC, classification, and MLST steps of the pipeline. The spriggan_multiqc_report.html file contains tables and figures of quality metrics from the FastQC, BBDuk, Samtools, Kraken, and QUAST steps of the pipeline.

Output

An example of Spriggan's output directory structure and its output files can be seen below:

spriggan_results
├── amrfinder
│   ├── *.amr.tsv
│   ├── *.fa
│   ├── amrfinder_predictions.tsv
│   ├── amrfinder_summary.tsv
│   └── selected_ar_genes.tsv
├── bbduk
│   ├── *.fastq.gz
│   ├── *.adapter.stats.txt
│   ├── *.bbduk.log
│   ├── *.trim.txt
│   └── bbduk_results.tsv
├── calculate
│   ├── *_Assembly_ratio_*.tsv
│   ├── *_GC_content_*.tsv
├── coverage
│   └── coverage_stats.tsv
├── fastqc
│   ├── *.html
│   ├── *.zip
│   └── fastqc_summary.tsv
├── kraken
│   ├── *.kraken2.txt
│   ├── kraken_results.tsv
│   └── kraken2.log
├── mlst
│   ├── *.alleles.tsv
│   ├── *.mlst.tsv
│   └── mlst_results.tsv
├── multiqc
│   ├── multiqc_data
│   │   ├── *.json
│   │   ├── *.txt
│   │   └── multiqc.log
│   ├── multiqc_plots
│   │   ├── pdf
│   │   │   └── *.pdf
│   │   ├── png
│   │   │   └── *.png
│   │   └── svg
│   │       └── *.svg
│   └── spriggan_multiqc_report.html
├── pipeline_info
│   ├── *.html
│   ├── *.txt
│   ├── samplesheet.valid.csv
│   └── software_versions.yml
├── quast
│   ├── *.quast.report.tsv
│   ├── *.transposed.quast.report.tsv
│   └── quast_results.tsv
├── results
│   └── spriggan_report.csv
├── samtools
│   ├── *.bam
│   ├── *.depth.tsv
│   └── *.stats.txt
└── shovill
    ├── *.contigs.fa
    ├── *.sam
    └── shovill_output
          ├── contigs.gfa
          ├── shovill.corrections
          ├── shovill.log
          └── spades.fasta

Notable output files:
spriggan_report.csv - Summary table of each step in Spriggan
spriggan_multiqc_report.html - HTML report generated by MultiQC
*.contigs.fa - Shovill assembly for each sample
*.amr.tsv - AMR genes identified in each sample by AMRFinderPlus
*.mlst.tsv - MLST scheme identified for each sample

Authors

Kelsey Florek, WSLH Senior Genomics and Data Scientist
Abigail Shockey, WSLH Bioinformatician and Data Scientist

Name		Name	Last commit message	Last commit date
Latest commit History 354 Commits
.github		.github
assets		assets
bin		bin
conf		conf
lib		lib
modules		modules
samplesheets		samplesheets
subworkflows/local		subworkflows/local
test-dataset		test-dataset
workflows		workflows
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CITATIONS.md		CITATIONS.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
pyproject.toml		pyproject.toml
tower.yml		tower.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spriggan

Table of Contents:

Usage

Input

Parameters

Workflow outline

Read trimming and quality assessment

Genome assembly

Assembly quality assessment

Genome coverage

Antimicrobial resistance gene detection

MLST scheme

Contamination detection

Assembly calculations

Summary

Output

Authors

About

Releases 9

Packages

Contributors 4

Languages

License

wslh-bio/spriggan

Folders and files

Latest commit

History

Repository files navigation

Spriggan

Table of Contents:

Usage

Input

Parameters

Workflow outline

Read trimming and quality assessment

Genome assembly

Assembly quality assessment

Genome coverage

Antimicrobial resistance gene detection

MLST scheme

Contamination detection

Assembly calculations

Summary

Output

Authors

About

Resources

License

Stars

Watchers

Forks

Releases 9

Packages 0

Contributors 4

Languages

Packages