LolliPop - a tool for Deconvolution for Wastewater Genomics
The LolliPop tool is part of the V-pipe workflow for analysing NGS data of short viral genomes.
Wastewater-based monitoring has become an increasingly important source of information on the spread of SARS-CoV-2 variants since clinical tests are declining and may eventually disappear.
LolliPop has been developed to improve wastewater-based genomic surveillance as the number of variants of concern increased and to account for shared mutations among variants. It relies on a kernel-based deconvolution, and leverages the time series nature of the samples. This approach enables to generate higher confidence relative abundance curves despite the very high noise and overdispersion present in wastewater samples.
It has been integrated in conjunction with COJAC into V-pipe, a workflow designed for the analysis of next generation sequencing (NGS) data from viral pathogens. These tools now form the basis of the SARS-CoV-2 wastewater genomic surveillance commissioned by the Swiss Federal Office of Public Health, a cornerstone of the COVID-19 pandemic surveillance in Switzerland. This surveillance covers daily samples at ten wastewater treatment plants across Switzerland from February 2021 onward, and delivers weekly updates of the variants relative abundance curves.
LolliPop provides several classes that can be used imported in Jupyter notebooks
from lollipop import *
See notebook WwSmoothingKernel.ipynb in directory preprint/
Here are the available command-line tools:
command | purpose |
---|---|
lollipop generate-mutlist |
Generate the mutlist used when looking for variant using variant signatures |
lollipop getmutations from-basecount |
Search a single sample for mutations and retrieve frequency from a TSV table of per-position base counts produced by V-pipe |
lollipop deconvolute |
Run the deconvolution on a timeline of mutations |
Use option -h
/ --help
to see available command-line options:
$ lollipop generate-mutlist --help
Usage: lollipop generate-mutlist [OPTIONS] VOC_YAML
Generate the mutlist used when looking for variant using variant signatures
Options:
-o, --output, --out TSV Write results to this output TSV instead of
'mutlist.tsv'
-p, --out-pangovars, --output-variants-pangolin YAML
Write a YAML mapping shortnames/columnnames
to the Pangolineages (useful to make the
'variants_pangolin' section of deconvolute's
input configuration)
-g, --genes GFF Add 'gene' column to table
-d, --voc-dir PATH Scan directory for additional voc YAML files
-v, --verbose / -V, --no-verbose
Verbose (dumps table on terminal)
-h, --help Show this message and exit.
$ lollipop getmutations from-basecount --help
Usage: lollipop getmutations from-basecount [OPTIONS] BASECOUNT
Search mutations and retrieve frequency from a TSV table produced by V-pipe
Options:
-o, --outname, --output PATH Filename of the final output table. If not
provided, it defaults to
<samplename>_mutations.txt
-m, --muttable, --mutationtable PATH
Mutations helper table
-a, --based INTEGER Are the positions in the tsv 0-based or
1-based?
Argument used for simple concatenation:
These options allows subsequently building
simply by concatenation (using `xsv`, or
even `tail` & `head`)
-l, --location TEXT Location of this sample
-d, --date TEXT Date of this sample
Argument use for V-pipe integration:
These options help tracking output to the
2-level samples structure used by V-pipe
-s, --sample, --samplename TEXT
'sample_name' as found in the first column
of the V-pipe samples.tsv
-b, --batch TEXT 'batch'/'date' as in the second column of
the V-pipe samples.tsv
-h, --help Show this message and exit.
$ lollipop deconvolute --help
Usage: lollipop deconvolute [OPTIONS] TALLY_TSV
Deconvolution for Wastewater Genomics
Options:
-o, --output CSV Write results to this output CSV instead of
'deconvolved.csv'
-C, --fmt-columns Change output CSV format to one column per
variant (normally, variants are each on a
separate line)
--out-json, --oj JSON Also write a JSON results for upload to Cov-
spectrum, etc.
-c, --variants-config, --var YAML
Variants configuration used during
deconvolution [required]
--variants-dates, --vd YAML Variants to scan per periods (as determined
with cojac)
-k, --deconv-config, --dec YAML
Configuration of parameters for kernel
deconvolution [required]
-l, --loc, --location, --wwtp, --catchment NAME
Name(s) of location/wastewater treatment
plant/catchment area to process
-fl, --filters YAML List of filters for removing problematic
mutations from tally
-s, --seed SEED Seed the random generator
-n, --n-cores N Cores for parallel processing for multiple locations,
defaults to 1 for sequential processing
-nf, --namefield COLUMN column to use as 'names' for the entries in
tally table. By default, if 'pos' and 'base'
exist a column 'mutations' will be created
and used as name.
-h, --help Show this message and exit.
Analysis can be performed on virus samples sequenced with most tiled multiplexed PCRs amplification protocols. Having coverage across the whole genome of the virus increases the chance of some variant-specific mutations being picked up and increasing the confidence, even if dropouts are experienced on some other regions of the genome (e.g.: dropouts on the fragment carrying the binding domain).
Sampling dates are important information to keep track of because LolliPop leverages time series.
Analysis will use variants description YAML that lists mutations to be searched --
the same YAMLs as used by COJAC. You can
refer to COJAC's commands cojac sig-generate
to help generate exhaustive
lists from requests on Cov-Spectrum or
TSV files of Covariants.org,
or cojac phe2cojac
to import ready-made manually-curated lists from YMLs available at
PHE Genomic's Standardised Variant Definitions.
Generate a list of mutation to be searched:
lollipop generate-mutlist --output mutlist.tsv --out-pangovars variants_pangolin.yaml --genes Genes_NC_045512.2.GFF3 -- vocs/delta_mutations_full.yaml vocs/omicron_ba1_mutations_full.yaml vocs/omicron_ba2_mutations_full.yaml
- Annotating the list with a GFF file is optional: Lollipop's deconvolution does not use genes information, but it could be useful for downstream visualizations.
--out-pangovars
writes a table mapping back short names to full Pangolineages. It can be useful to help write (or be used in lieu of) a variants' config.
By default, LolliPop searches the mutations into a basecount TSV, a table that
gives per position coverage of each A, T, C, G bases and deletion.
V-pipe generates such a TSV using
smallgenomeutilities's command aln2basecnt
,
you can use it in your workflow when starting from alignments:
aln2basecnt --first 1 --basecnt sample1.basecnt.tsv.gz --coverage sample1.coverage.tsv.gz --name "sample1" sample1.bam
--first
is used to specify if the positions in the TSV are 1-based (like samtools) or 0-based (like pysam).
Then, search this TSV files for the mutations from the list generated above:
lollipop getmutations from-basecount --based 1 --output sample1.mut.tsv --location "main plant" --date "2023-02-27" -m mutlist.tsv -- sample1.basecnt.tsv.gz
- options
--location
and--date
are a straightforward way to add the time series information for each sample
(a future version of LolliPop will be extended to support VCFs and coverage TSV as a more standard input)
Once the above step has been run on every single sample of the cohort, combine all individual samples into a single heatmap-like object tracking the mutation overtime across all samples. This can be done by concatenating all the per-sample mutations TSVs with a tool such as xsv:
xsv cat rows --output tallymut.tsv sample*.mut.tsv
- If you have not tagged each individual sample with
--location
and--date
, now it would be a good time to add extra columns to tallymut.tsv, e.g., with a join operation.- Note that this file can get quite huge. It is possible to compress it on the fly:
… | xsv fmt --out-delimiter '\t' | gzip -o tallymut.tsv.gz
The deconvolution can now be run on this table
Various aspects of the kernel-based deconvolution can be set with a YAML file: type of kernel (box vs Gaussian) and its parameters (such as bandwidth), regressor used, using bootstrapping to generate confidence value, estimating confidence intervals with Wald, computing the estimates on a logit scale, etc.
Various presets are available in the presets/ subdirectory.
For example:
kernel: 'gaussian'
kernel_params:
bandwidth: 10
regressor: 'robust'
deconv_params:
min_tol: 1e-3
This file controls the data set that the deconvolution runs on. At minimum, it
should have a section mapping the short names back to full Pangolineages. This
can be copied by the file generated with --out-pangovars
on the first step
(or that file reused as-is).
But this can also be used to optionally specify time limits (start_date
and/or end_date
), the subset of variants (variants_list
) or locations
(locations_list
) to run deconvolution onto, variants column to delete
(variants_not_reported
) before processing any further, not considering the
deletions (remove_deletions
), etc.
see example in config_preprint.yaml.
The deconvolution performs much better if only the variants known to be present in the mixture are considered. For longer-running experiment, it is therefore possible to specify, for different time periods, the list of variants to consider for deconvolution, based on their previous detection with a sensitive tool, e.g, such as determined running COJAC and looking for amplicons carrying mutations combinations which are exclusive for certain variants.
For example:
var_dates:
'2022-06-15':
- BA.1
- BA.2
- BA.4
- BA.5
- BA.2.75
'2022-08-15':
- BA.4
- BA.5
- BA.2.75
- BQ.1.1
'2022-11-01':
- BA.4
- BA.5
- BA.2.75
- BQ.1.1
- XBB
see variants_dates_example.yaml.
Some mutations might be problematic and need to be taken out -- e.g. due to drop-outs in the multiplex PCR amplification, they do not show up in the data and this could be misinterpreted by LolliPop as proof of absence of a variant. This optional file contains a collection of filters. Each filter has a list of statements with the following syntax:
- <column> <op> <value>
Valid op are:
==
on that line, the value in column is exactly value- for simple strings this can be omitted:
- proto v3
is synonymous with- proto == v3
- for simple strings this can be omitted:
<=
the value is less than or equal to value>=
the value is greater than or equal to value<
the value is less than value>
the value is greater than value!=
the value is not valuein
the value is found in the list specidied in value~
the value matches the regular expression in value- regex can be quoted using
/
or@
- regex can be quoted using
!~
the vlue does not matche the regular expression in value
Any arbitrary column found in the input file can be used.
All statements of a filter are combined with a logical and
and matching lines are removed from the tally table.
Filters are processed in the order found in the YAML file.
For example:
# filter to remove test samples
remove_test:
- sample ~ /^Test/
# filter to remove an amplicon that has drop-outs
amplicon75:
- proto v3
- date > 2021-11-20
- pos >= 22428
- pos <= 22785
see example in filters_preprint.yaml.
lollipop deconvolution --output=deconvoluted.tsv --out-json=deconvoluted_upload.json --var=variants_conf.yaml --vd=variants_dates.yaml --dec=deconv_linear.yaml --seed=42 --n-cores=8 -- tallymut.tsv
The output is tabular:
location | date | variant | proportion |
---|---|---|---|
main plant | 2023-02-27 | BA.4 | 0.000 |
Optionally, LolliPop can also package the results in a JSON structure, e.g., to be sent to online dashboards:
{
"mainplant": {
"BA.4": {
"timeseriesSummary": [
{
"date": "2023-02-27",
"proportion": 0.000
},
{
"date": "
… etc …
"
}
]
}
}
}
The repository cowwid contains real-world examples of downstream analysis of the output of LolliPop.
We recommend using bioconda software repositories for easy installation. You can find instructions to setup your bioconda environment at the following address:
LolliPop and its dependencies are all available in the bioconda repository. We strongly advise you to install this pre-built package for a hassle-free experience.
You can install lollipop in its own environment and activate it:
conda create -n lollipop lollipop
conda activate lollipop
# test it
lollipop --help
And to update it to the latest version, run:
# activate the environment if not already active:
conda activate lollipop
conda update lollipop
Or you can add it to the current environment (e.g.: in environment base):
conda install lollipop
If you want to install the software yourself, you can see the list of
dependencies in conda_lollipop_env.yaml
.
We recommend using conda to install them:
conda env create -f conda_lollipop_env.yaml
conda activate lollipop
Install lollipop using pip:
# install both the python module and the cli
pip install '.[cli]'
# (this will autodetect dependencies already installed by conda)
The command lollipop
should now be accessible from your PATH
# activate the environment if not already active:
conda activate lollipop
lollipop --help
You can remove the conda environment if you do not need it any more:
# exit the lollipop environment first:
conda deactivate
conda env remove -n lollipop
LolliPop has its dependencies in a pyproject.toml managed with poetry and can be installed with it.
# If not installed system-wide: manually run poetry-dynamic-versioning
poetry-dynamic-versioning
# (this sets the version string from the git currently cloned and checked out)
poetry install --extras "cli"
- Support VCFs and coverage TSV as alternative to basecount TSV
Long term goal:
- [x] Inputs other than SNVs: can deconvolute COJAC's output tables
If you use this software in your research, please cite:
-
David Dreifuss, Ivan Topolsky, Pelin Icer Baykal & Niko Beerenwinkel
"Tracking SARS-CoV-2 genomic variants in wastewater sequencing data with LolliPop."
medRxiv; doi:10.1101/2022.11.02.22281825
If you experience problems running the software:
- We encourage using the issue tracker on GitHub
- For further enquiries, you can also contact the V-pipe Dev Team
- You can contact the publication’s corresponding author