Skip to content

Commit

Permalink
1.4.2 release
Browse files Browse the repository at this point in the history
  • Loading branch information
sigven committed May 24, 2021
1 parent 083abbf commit ec027f5
Show file tree
Hide file tree
Showing 10 changed files with 202 additions and 109 deletions.
45 changes: 22 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,30 +15,29 @@ The germline variant annotator (*gvanno*) is a software package intended for ana
*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record __per sample variant__.

### News
* April 22nd 2021 - **dev update**
* Data updates (ClinVar, UniProt, GWAS Catalog, dbNSFP, Pfam, Open Targets Platform)
* Software update (VEP 103)
* May 24th 2021 - **1.4.2 release**
* Software update (VEP 104)
* Data updates: ClinVar, GWAS catalog, CancerMine, Pfam, dbNSFP, UniProt
* Two new options added:
* `--vep_regulatory` - annotates variants for overlap with regulatory regions
* `--vep_regulatory` - annotates variants for overlap with regulatory regions (details below)
* `--docker-uid` - set Docker user id
* December 7th 2020 - **1.4.1 release**
* Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform)
* Software update (VEP 102)
* Skipped DisGenet annotations (Open Targets serve similar purpose)
* New variant annotations for enhanced non-coding interpretation:
* _REGULATORY_ANNOTATION_ : A comma-separated list of regulatory annotations from VEP's `--regulatory` option, i.e. __TF_binding_site__, __enhancer/promoter/open_chromatin__, __CTCF_binding_site__ etc. Included when the `--vep_regulatory` option is turned on in gvanno.
* _NCER_PERCENTILE__: A genome-wide percentile rank score from the ncER algorithm (**n**on-**c**oding **E**ssential **R**egulation), [Wells et al., Nat Comm. (2019)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6868241/).

### Annotation resources

* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v103 (GENCODE v37/v19 as the gene reference dataset)
* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v104 (GENCODE v38/v19 as the gene reference dataset)
* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021)
* [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 153) - from VEP
* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 154) - from VEP
* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (April 2021)
* [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 34)
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (May 2021)
* [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 35)
* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2021_02, February 2021)
* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2021_02, April 2021)
* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v34.0, March 2021)
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (April 12th 2021)
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (May 19th 2021)


### Getting started
Expand Down Expand Up @@ -72,15 +71,15 @@ An installation of Python (version >=_3.6_) is required to run *gvanno*. Check t

#### STEP 2: Download *gvanno* and data bundle

1. Clone the latest version in development
1. [Download the latest version](https://github.com/sigven/gvanno/releases/tag/v1.4.2) (gvanno run script, v1.4.2)
2. Download and unpack the latest assembly-specific data bundle in the gvanno directory
* [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20210422.tgz) (approx 18Gb)
* [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20210422.tgz) (approx 20Gb)
* [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20210523.tgz) (approx 19Gb)
* [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20210523.tgz) (approx 20Gb)
* *Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`

A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced
3. Pull the [gvanno Docker image (dev)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.4Gb):
* `docker pull sigven/gvanno:dev` (gvanno annotation engine)
3. Pull the [gvanno Docker image (1.4.2)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.4Gb):
* `docker pull sigven/gvanno:1.4.2` (gvanno annotation engine)

#### STEP 3: Input preprocessing

Expand Down Expand Up @@ -126,7 +125,7 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
Number of forks for Variant Effect Predictor (VEP) processing, default: 4
--vep_buffer_size VEP_BUFFER_SIZE
Variant buffer size (variants read into memory simultaneously) for Variant Effect Predictor (VEP) processing
- set lower to reduce memory usage, default: 5000
- set lower to reduce memory usage, higher to increase speed, default: 500
--vep_pick_order VEP_PICK_ORDER
Comma-separated string of ordered transcript properties for primary variant pick in
Variant Effect Predictor (VEP) processing, default: canonical,appris,biotype,ccds,rank,tsl,length,mane
Expand All @@ -145,10 +144,10 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt

The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:

python ~/gvanno-dev/gvanno.py
--query_vcf ~/gvanno-dev/examples/example.grch37.vcf.gz
--gvanno_dir ~/gvanno-dev
--output_dir ~/gvanno-dev
python ~/gvanno-1.4.2/gvanno.py
--query_vcf ~/gvanno-1.4.2/examples/example.grch37.vcf.gz
--gvanno_dir ~/gvanno-1.4.2
--output_dir ~/gvanno-1.4.2
--sample_id example
--genome_assembly grch37
--container docker
Expand Down
14 changes: 14 additions & 0 deletions data-raw/RELEASE_NOTES
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
##GVANNO_SOFTWARE_VERSION = 1.4.3
##GVANNO_DB_VERSION = 20210523
pfam = v34.0 (March 2021)
ncER = v1.0 (March 2019)
uniprot = release 2021_02
corum = release 3.0 (20180903)
onekg = phase 3 (20130502)
dbsnp = build 154/153
dbnsfp = v4.2 (March 2021)
gnomad = r2.1 (October 2018)
gwas = May 2021 (20210519)
clinvar = May 2021 (20210501)
opentargets = 2021_02
gencode = 38/19
15 changes: 8 additions & 7 deletions gvanno.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@
import platform
from argparse import RawTextHelpFormatter

GVANNO_VERSION = 'dev'
DB_VERSION = 'GVANNO_DB_VERSION = 20210422'
VEP_VERSION = '103'
GENCODE_VERSION = '37'
GVANNO_VERSION = '1.4.2'
DB_VERSION = 'GVANNO_DB_VERSION = 20210523'
VEP_VERSION = '104'
GENCODE_VERSION = '38'
VEP_ASSEMBLY = "GRCh38"
DOCKER_IMAGE_VERSION = 'sigven/gvanno:' + str(GVANNO_VERSION)

Expand All @@ -41,8 +41,8 @@ def __main__():
optional_vep.add_argument('--vep_lof_prediction', action = "store_true", help = "Predict loss-of-function variants with Loftee plugin " + \
"in Variant Effect Predictor (VEP), default: %(default)s")
optional_vep.add_argument('--vep_n_forks', default = 4, help="Number of forks for Variant Effect Predictor (VEP) processing, default: %(default)s")
optional_vep.add_argument('--vep_buffer_size', default = 5000, help="Variant buffer size (variants read into memory simultaneously) " + \
"for Variant Effect Predictor (VEP) processing\n- set lower to reduce memory usage, default: %(default)s")
optional_vep.add_argument('--vep_buffer_size', default = 500, help="Variant buffer size (variants read into memory simultaneously) " + \
"for Variant Effect Predictor (VEP) processing\n- set lower to reduce memory usage, higher to increase speed, default: %(default)s")
optional_vep.add_argument('--vep_pick_order', default = "canonical,appris,biotype,ccds,rank,tsl,length,mane", help="Comma-separated string " + \
"of ordered transcript properties for primary variant pick in\nVariant Effect Predictor (VEP) processing, default: %(default)s")
optional_vep.add_argument('--vep_skip_intergenic', action = "store_true", help="Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: %(default)s")
Expand Down Expand Up @@ -384,7 +384,8 @@ def run_gvanno(arg_dict, host_directories):
logger = getlogger("gvanno-summarise")
logger.info("STEP 3: Summarise gene and variant annotations with gvanno-summarise")
gvanno_summarise_command = str(container_command_run2) + "gvanno_summarise.py " + str(vep_vcfanno_vcf) + ".gz " + \
os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + " " + str(int(arg_dict['vep_lof_prediction'])) + docker_command_run_end
os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + " " + str(int(arg_dict['vep_lof_prediction'])) + \
" " + str(int(arg_dict['vep_regulatory'])) + docker_command_run_end
check_subprocess(gvanno_summarise_command)
logger.info("Finished")

Expand Down
Binary file modified src/.DS_Store
Binary file not shown.
4 changes: 2 additions & 2 deletions src/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ RUN apt-get update && apt-get -y install \
ENV OPT /opt/vep
ENV OPT_SRC $OPT/src
ENV HTSLIB_DIR $OPT_SRC/htslib
ENV BRANCH release/103
ENV BRANCH release/104

# Working directory
WORKDIR $OPT_SRC
Expand Down Expand Up @@ -65,7 +65,7 @@ RUN if [ "$BRANCH" = "master" ]; \
rm -rf kent-335_base_bak

# Setup bioperl-ext
WORKDIR bioperl-ext/Bio/Ext/Align/
WORKDIR $OPT_SRC/bioperl-ext/Bio/Ext/Align/
RUN perl -pi -e"s|(cd libs.+)CFLAGS=\\\'|\$1CFLAGS=\\\'-fPIC |" Makefile.PL

# Install htslib binaries (for 'bgzip' and 'tabix')
Expand Down
2 changes: 1 addition & 1 deletion src/buildDocker.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ cp /Users/sigven/research/docker/pcgr/src/pcgr/lib/annoutils.py gvanno/lib/
tar czvfh gvanno.tgz gvanno/
echo "Build the Docker Image"
TAG=`date "+%Y%m%d"`
docker build --no-cache -t sigven/gvanno:$TAG --rm=true .
docker build -t sigven/gvanno:$TAG --rm=true .

Binary file modified src/gvanno.tgz
Binary file not shown.
34 changes: 25 additions & 9 deletions src/gvanno/gvanno_summarise.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,12 @@ def __main__():
parser.add_argument('vcf_file', help='VCF file with VEP-annotated query variants (SNVs/InDels)')
parser.add_argument('gvanno_db_dir',help='gvanno data directory')
parser.add_argument('lof_prediction',default=0,type=int,help='VEP LoF prediction setting (0/1)')
parser.add_argument('regulatory_annotation',default=0,type=int,help='Inclusion of VEP regulatory annotations (0/1)')
args = parser.parse_args()

extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction)
extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction, args.regulatory_annotation)

def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, regulatory_annotation = 0):
"""
Function that reads VEP/vcfanno-annotated VCF and extends the VCF INFO column with tags from
1. CSQ elements within the primary transcript consequence picked by VEP, e.g. SYMBOL, Feature, Gene, Consequence etc.
Expand All @@ -40,13 +41,19 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
vep_csq_fields_map = meta_vep_dbnsfp_info['vep_csq_fieldmap']
vcf = VCF(query_vcf)
for tag in vcf_infotags_meta:
if lof_prediction == 0:
if lof_prediction == 0 and regulatory_annotation == 0:
if not tag.startswith('LoF') and not tag.startswith('REGULATORY_'):
vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
elif lof_prediction == 1 and regulatory_annotation == 0:
if not tag.startswith('REGULATORY_'):
vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
elif lof_prediction == 0 and regulatory_annotation == 1:
if not tag.startswith('LoF'):
vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
else:
vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})


w = Writer(out_vcf, vcf)
current_chrom = None
num_chromosome_records_processed = 0
Expand Down Expand Up @@ -107,11 +114,20 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
num_chromosome_records_processed += 1
gvanno_xref = annoutils.make_transcript_xref_map(rec, gvanno_xref_map, xref_tag = "GVANNO_XREF")

csq_record_results = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ')
if 'vep_all_csq' in csq_record_results:
rec.INFO['VEP_ALL_CSQ'] = ','.join(csq_record_results['vep_all_csq'])
if 'vep_block' in csq_record_results:
vep_csq_records = csq_record_results['vep_block']
if regulatory_annotation == 1:
csq_record_results_all = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = False, csq_identifier = 'CSQ')

if 'vep_block' in csq_record_results_all:
vep_csq_records_all = csq_record_results_all['vep_block']
rec.INFO['REGULATORY_ANNOTATION'] = annoutils.map_regulatory_variant_annotations(vep_csq_records_all)

csq_record_results_pick = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ')

if 'vep_all_csq' in csq_record_results_pick:
rec.INFO['VEP_ALL_CSQ'] = ','.join(csq_record_results_pick['vep_all_csq'])
if 'vep_block' in csq_record_results_pick:
vep_csq_records = csq_record_results_pick['vep_block']

block_idx = 0
record = vep_csq_records[block_idx]
for k in record:
Expand Down
Loading

0 comments on commit ec027f5

Please sign in to comment.