This repository contains a summary of the work done during my traineeship at Sanger Institute, UK.
The aim of this project is to implement some new features to NCBoost annotation pipeline. NCBoost is a machine learning model (gradient tree boosting) designed to assign pathogenicity score to non-coding genetic single nucleotide variants (SNVs) in humans. The model has been trained by the authors using experimentally validated non-coding pathogenic variants and the prediction is based on features that belong to the following categories:
- Interspecies conservation
- Recent and ongoing selection signals in humans
- Gene-based features
- Sequence context
- Epigenetic features
In this context, during my project, I developed modified versions of the pipeline to:
- annotate and score INDELS
- implement chromatin-chroma interaction data during gene assignment step
- visualize the score on genome browser interfaces
- bug fix
The path to reach NCBoost on the farm is /nfs/team151/fz3/NCBoost
, all the scripts are in NCBoost/ncboost_scripts
All the commands described above may be run from the root NCBoost folder.
Some files are just linked in the NCBoost root directory, their physical location is:
NCBoost_features -> /lustre/scratch119/humgen/teams/soranzo/users/fz3/NCBoost/NCBoost_features
python libraries -> /nfs/team151/fz3/pyPack
R libraries -> /nfs/team151/fz3/RPack
reference genome -> /lustre/scratch115/teams/soranzo/projects/MS_GWAS_txt_tables/reference_files/Homo_sapiens.GRCh37.dna.fa
Functioning NCBoost system needs to be installed on your machine to use these pipelines, learn how to do that here.
Then, clone this repository in NCBoost root folder and copy all the scripts present in NCtools/NCtools_scripts/
into NCBoost_scripts/
will be replaced with the bug fixed version.
The additional dependecy are samtools and human reference genome sequence that you can download from ensambl ftp server:
Run this script for the standard annotation and scoring pipeline provided by NCBoost (bug fixed).
./NCBoost_scripts/ /path/to/inF.tsv /path/to/outF.tsv
Run this script to annotate and score INDELS variants using a representative SNVs.
The additional argument is the reference genome, for more detail see here
If you are working on the farm you will find a reference geneome in /NCBoost/ref_genome
./NCBoost_scripts/ /path/to/inF.tsv /path/to/outF.tsv \
Run this script to use chromatin-chromatin interaction data (PCHiC) as gene to variant assignment method.
The additional arguments are the interaction table and the interaction threshold, for more detail see here
./NCBoost_scripts/ /pathto/inF.tsv /pathto/outF.tsv \
/PCHIC_data.txt \
If you are working on the farm you will find PCHIC intercation data in /NCBoost/PCHIC_data
Run this script to generate a .wig file to display patogenicity score of a region of interest.
The first argument defines the genomic region (es. 13:32315086-32400266
), the second is the reference genome and the third (optional) can be use to assign a name to the track.
An example of the resulting track is given below (intron region of gene IKZF1).
./NCBoost_scripts/ chr:start-end referenece.fa sample
The original NCBoost1.0 scripts and files were download on 11th of April 2019 and my project took place in the following 4 month.
- NCBoost1.0
- Homo sapien reference genome GRCh37 release 87 from ENSEMBL
- ANNOVAR latest version (2018-Apr-16)
- For NCBoost libraries version and dependencies have a look here
1: Caron et al. (2019). NCBoost classifies pathogenic non-coding variants in Mendelian diseases through supervised learning on purifying selection signals in humans. Genome Biology. 20(1), 20:32.
2: Javierre et al.(2016). Lineage-Specific Genome Architecture Links Enhancers and Non-coding Disease Variants to Target Gene Promoters. Cell, 167(5), 1369-1384.e19.
Please address comments and questions about NCtool to [email protected]
NCtools scripts are available under the Apache License 2.0.
Read more about here