A collection of pipelines for tasks associated with VEP annotation of variations. The code contained within this repository is designed to run on the EBI's compute farm using the EnsEMBL Hive pipeline manager.
See the README files in each folder for details of the individual pipelines.
- Generates databases of SIFT and PolyPhen scores and predictions for the MOD species.
- Databases are accessed by a VEP plugin to add SIFT/PolyPhen annotation to the VEP output.
- Translated protein sequences are constructed from FASTA, GFF, and (optionally) BAM files.
- Serialized prediction matrices are stored containing all possible amino acid substitutions for each sequence, accessed by the hex md5 of the sequence.
- FULL mode generates the database from scratch.
- UPDATE mode checks for existing prediction matrices, or sequences for which there were valid reasons for being unable to generate protein function annotations, and updates the database for new sequences only.
- Runs VEP on human variant VCF file (obtained from the EnsEMBL FTP site).
- Splits input files, runs VEP in parallel, then combines the output.
- Uses the EnsEMBL merged (EnsEMBL & RefSeq) cached database to retrieve VEP annotations.
- Runs VEP on MOD high throughput variation VCF files
- Splits input files, runs VEP in parallel, then combines the output.
- Uses MOD GFF, FASTA, and (optionally) BAM files to construct translated protein sequences.
- Retrieves SIFT and PolyPhen annotations from databases generated by the VepProteinFunction pipeline.