GitHub - tinavisnovska/gvanno: Generic germline variant annotation pipeline

gvanno - workflow for functional and clinical annotation of germline nucleotide variants

Overview

The germline variant annotator (gvanno) is a software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below). Technically, the workflow is built with the Docker technology, and it can also be installed through the Singularity framework.

gvanno accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon Ensembl’s Variant Effect Predictor (VEP), and vcfanno. It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record per sample variant.

News

April 22nd 2021 - dev update
- Data updates (ClinVar, UniProt, GWAS Catalog, dbNSFP, Pfam, Open Targets Platform)
- Software update (VEP 103)
- Two new options added:
  - --vep_regulatory - annotates variants for overlap with regulatory regions
  - --docker-uid - set Docker user id
December 7th 2020 - 1.4.1 release
- Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform)
- Software update (VEP 102)
- Skipped DisGenet annotations (Open Targets serve similar purpose)

Annotation resources

VEP - Variant Effect Predictor v103 (GENCODE v37/v19 as the gene reference dataset)
dBNSFP - Database of non-synonymous functional predictions (v4.2, March 2021)
gnomAD - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
dbSNP - Database of short genetic variants (build 153) - from VEP
1000 Genomes Project - phase3 - Germline variant frequencies genome-wide (May 2013) - from VEP
ClinVar - Database of variants related to human health/disease phenotypes (April 2021)
CancerMine - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 34)
Open Targets Platform - Target-disease and target-drug associations (2021_02, February 2021)
UniProt/SwissProt KnowledgeBase - Resource on protein sequence and functional information (2021_02, April 2021)
Pfam - Database of protein families and domains (v34.0, March 2021)
NHGRI-EBI GWAS Catalog - Catalog of published genome-wide association studies (April 12th 2021)

Getting started

STEP 0: Python

An installation of Python (version >=3.6) is required to run gvanno. Check that Python is installed by typing python --version in your terminal window.

STEP 1: Installation of Docker

Install the Docker engine on your preferred platform
- installing Docker on Linux
- installing Docker on Mac OS
- NOTE: We have not yet been able to perform enough testing on the Windows platform, and we have received feedback that particular versions of Docker/Windows do not work with PCGR (an example being mounting of data volumes)
Test that Docker is running, e.g. by typing docker ps or docker images in the terminal window
Adjust the computing resources dedicated to the Docker, i.e.:
- Memory: minimum 5GB
- CPUs: minimum 4
- How to - Mac OS X

1.1: Installation of Singularity (optional)

Note: this works for Singularity version 3.0 and higher.
Install Singularity
Test that singularity works by running singularity --version
If you are in the gvanno directory, build the singularity image like so:

cd src

sudo ./buildSingularity.sh

STEP 2: Download gvanno and data bundle

Clone the latest version in development
Download and unpack the latest assembly-specific data bundle in the gvanno directory
- grch37 data bundle (approx 18Gb)
- grch38 data bundle (approx 20Gb)
- Unpacking: gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -
A data/ folder within the gvanno-X.X software folder should now have been produced
Pull the gvanno Docker image (dev) from DockerHub (approx 2.4Gb):
- docker pull sigven/gvanno:dev (gvanno annotation engine)

STEP 3: Input preprocessing

The gvanno workflow accepts a single input file:

An unannotated, single-sample VCF file (>= v4.2) with germline variants (SNVs/InDels)

We strongly recommend that the input VCF is compressed and indexed using bgzip and tabix. NOTE: If the input VCF contains multi-allelic sites, these will be subject to decomposition.

STEP 5: Run example

Run the workflow with gvanno.py, which takes the following arguments and options:

usage:
gvanno.py -h [options]
--query_vcf <QUERY_VCF>
--gvanno_dir <GVANNO_DIR>
--output_dir <OUTPUT_DIR>
--genome_assembly <grch37|grch38>
--sample_id <SAMPLE_ID>
--container <docker|singularity>

gvanno - workflow for functional and clinical annotation of germline nucleotide variants

Required arguments:
--query_vcf QUERY_VCF
			    VCF input file with germline query variants (SNVs/InDels).
--gvanno_dir GVANNO_DIR
			    Directory that contains the gvanno data bundle, e.g. ~/gvanno-dev
--output_dir OUTPUT_DIR
			    Output directory
--genome_assembly {grch37,grch38}
			    Genome assembly build: grch37 or grch38
--container {docker,singularity}
			    Run gvanno with docker or singularity
--sample_id SAMPLE_ID
			    Sample identifier - prefix for output files

VEP optional arguments:
--vep_regulatory      Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP).
--vep_lof_prediction  Predict loss-of-function variants with Loftee plugin in Variant Effect Predictor (VEP), default: False
--vep_n_forks VEP_N_FORKS
			    Number of forks for Variant Effect Predictor (VEP) processing, default: 4
--vep_buffer_size VEP_BUFFER_SIZE
			    Variant buffer size (variants read into memory simultaneously) for Variant Effect Predictor (VEP) processing
			    - set lower to reduce memory usage, default: 5000
--vep_pick_order VEP_PICK_ORDER
			    Comma-separated string of ordered transcript properties for primary variant pick in
			    Variant Effect Predictor (VEP) processing, default: canonical,appris,biotype,ccds,rank,tsl,length,mane
--vep_skip_intergenic
			    Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: False

Other optional arguments:
--force_overwrite     By default, the script will fail with an error if any output file already exists.
			    You can force the overwrite of existing result files by using this flag, default: False
--version             show program's version number and exit
--no_vcf_validate     Skip validation of input VCF with Ensembl's vcf-validator, default: False
--docker_uid DOCKER_USER_ID
			    Docker user ID. default is the host system user ID. If you are experiencing permission errors, try setting this up to root (`--docker-uid root`)
--vcfanno_n_processes VCFANNO_N_PROCESSES
			    Number of processes for vcfanno processing (see https://github.com/brentp/vcfanno#-p), default: 4

The examples folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:

python ~/gvanno-dev/gvanno.py
--query_vcf ~/gvanno-dev/examples/example.grch37.vcf.gz
--gvanno_dir ~/gvanno-dev
--output_dir ~/gvanno-dev
--sample_id example
--genome_assembly grch37
--container docker
--force_overwrite

This command will run the Docker-based gvanno workflow and produce the following output files in the examples folder:

example_gvanno_pass_grch37.vcf.gz (.tbi) - Bgzipped VCF file with rich set of functional/clinical annotations
example_gvanno_pass_grch37.tsv.gz - Compressed TSV file with rich set of functional/clinical annotations

Similar files are produced for all variants, not only variants with a PASS designation in the VCF FILTER column.

Documentation

Documentation of the various variant and gene annotations should be interrogated from the header of the annotated VCF file. The column names of the tab-separated values (TSV) file will be identical to the INFO tags that are documented in the VCF file.

Contact

sigven AT ifi.uio.no

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
examples		examples
src		src
README.md		README.md
gvanno.py		gvanno.py
test_examples_docker.py		test_examples_docker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gvanno - workflow for functional and clinical annotation of germline nucleotide variants

Contents

Overview

News

Annotation resources

Getting started

STEP 0: Python

STEP 1: Installation of Docker

1.1: Installation of Singularity (optional)

STEP 2: Download gvanno and data bundle

STEP 3: Input preprocessing

STEP 5: Run example

Documentation

Contact

About

Releases

Packages

Languages

tinavisnovska/gvanno

Folders and files

Latest commit

History

Repository files navigation

gvanno - workflow for functional and clinical annotation of germline nucleotide variants

Contents

Overview

News

Annotation resources

Getting started

STEP 0: Python

STEP 1: Installation of Docker

1.1: Installation of Singularity (optional)

STEP 2: Download gvanno and data bundle

STEP 3: Input preprocessing

STEP 5: Run example

Documentation

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages