Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
This uses the CPU version of SIFT4G.
- Build Docker image
git clone https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB.git
cd SIFT4G_Create_Genomic_DB
docker build -t sift4g_db .
- Run an interactive container
docker run -it --user $(id -u):$(id -g) -v <your_directory>:<your_directory> sift4g_db /bin/bash
Make sure to mount directories that contain:
- your protein database
- the directory containing SIFT4G_Create_Genomic_DB
- your gene and genome files
For example, the full path of my SIFT4G_Create_Genomic_DB
directory is /home/pauline/SIFT4G_Create_Genomic_DB
and my protein database is in /bigdrive/SIFT_databases/uniprot_sprot.fasta
so my command is:
docker run -it --user $(id -u):$(id -g) -v /home/pauline:/home/pauline -v /bigdrive:/bigdrive sift4g_db /bin/bash
- Test the Docker installation
3a. C. ruddii example
C. ruddii is a small genome and can quickly test if everything is working. The genome and gene files are automatically downloaded from Ensembl.
-
Set variables in configuration file.
cd <your_dir>/SIFT4G_Create_Genomic_DB/test_files/candidatus_carsonella_ruddii_pv_config.txt
Edit the config file to set <PARENT_DIR>, <PROTEIN_DB>
(See config details) . Be sure to use full pathsmkdir <PARENT_DIR>
-
Create the database inside the docker container:
# Make an interactive session of the docker container docker run -it --user $(id -u):$(id -g) -v /home/pauline:/home/pauline -v /bigdrive:/bigdrive sift4g_db /bin/bash # Run the code inside the container perl make-SIFT-db-all.pl -config <your_dir>/SIFT4G_Create_Genomic_DB/test_files/candidatus_carsonella_ruddii_pv_config.txt --ensembl_download
It takes ~30 minutes for this database to be generated in <PARENT_DIR>/<ORG_VERSION>.
-
The database should be in a folder named something like: candidatus_carsonella_ruddii_pv/ASM1036v1.34
This example uses local files to build a database of human chr21 and mitochondrial genes. Do this exercise if you are building a SIFT database with a genome that is on your local computer.
Files are already provided in /SIFT4G_Create_Genomic_DB/test_files/homo_sapiens_small
- Genomic DNA (.fa.gz)
- Gene annotation (.gtf.gz)
- dbSNP annotations (.vcf.gz)
-
Set the variables in the config file.
cd <SIFT4G_Create_Genomic_DB directory>/test_files/
Set variables in the config file homo_sapiens-test.txt: <PARENT_DIR> and <PROTEIN_DB>
Note that <PARENT_DIR> should be set to the full path of <SIFT4G_Create_Genomic_DB directory>/test_files/homo_sapiens_small
SIFT scripts will look for the genome and gene annotation files in that folder (which are provided in this example). -
Create the database:
# Make an interactive session of the docker container docker run -it --user $(id -u):$(id -g) -v <your_directory>:<your_directory> sift4g_db /bin/bash # Run the code to make the database inside the container perl make-SIFT-db-all.pl -config <SIFT4G_Create_Genomic_DB directory>/test_files/homo_sapiens-test.txt
It takes ~2 hours for human chr21 and mitochondria predictions to be generated in <PARENT_DIR>/<ORG_VERSION>.
Please go to Creating your own database to read how to make your own database
- You can also run the SIFT4G Annotator inside the container
docker run -it --user $(id -u):$(id -g) -v <your_directory>:<your_directory> sift4g_db /bin/bash
java -jar <Path to SIFT4G_Annotator> -c -i <Path to input vcf file> -d <Path to SIFT4G database directory> -r <Path to your results folder> -t
Follow these instructions if you are NOT using Docker.
- SIFT 4G Algorithm
- Perl
*DBI
*Bioperl for running DB::Fasta
*LWP
*Switch.pm (sudo apt-get install libswitch-perl
) - Python3 (which is invoked with
python
)
git clone https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB.git scripts_to_build_SIFT_db
C. ruddii is a small genome and can quickly test if everything is working. The genome and gene files are automatically downloaded from Ensembl.
-
Set variables in configuration file.
cd test_files/candidatus_carsonella_ruddii_pv_config.txt
Edit the config file to set <PARENT_DIR>, <SIFT4G_PATH>, <PROTEIN_DB>
(See config details) . Be sure to use full pathsmkdir <PARENT_DIR>
Go back to the scripts_to_build_SIFT_db directory
cd ..
-
Make the database:
perl make-SIFT-db-all.pl -config test_files/candidatus_carsonella_ruddii_pv_config.txt --ensembl_download
It takes ~30 minutes for this database to be generated in <PARENT_DIR>/<ORG_VERSION>.
-
The database should be in a folder named something like: candidatus_carsonella_ruddii_pv/ASM1036v1.34
This example uses local files to build a database of human chr21 and mitochondrial genes. Do this exercise if you are building a SIFT database with a genome that is on your local computer.
Files are already provided in scripts_to_build_SIFT_db/test_files/homo_sapiens_small
- Genomic DNA (.fa.gz)
- Gene annotation (.gtf.gz)
- dbSNP annotations (.vcf.gz)
-
Set the variables in the config file.
cd scripts_to_build_SIFT_db/test_files/
Set variables in the config file homo_sapiens-test.txt: <SIFT4G_PATH> and <PROTEIN_DB>
Note that <PARENT_DIR> is already set to ./test_files/homo_sapiens_small
SIFT scripts will look for the genome and gene annotation files in that folder (which are provided in this example). -
Make the database:
cd ..
perl make-SIFT-db-all.pl -config test_files/homo_sapiens-test.txt
It takes ~2 hours for human chr21 and mitochondria predictions to be generated in <PARENT_DIR>/<ORG_VERSION>.
perl make-SIFT-db-all.pl -config <config_file> [--ensembl_download]
Directions for making a SIFT database from:
- Ensembl (easiest)
- local genomic and gene annotation file (.gtf)
- local genomic and gene annotation file (.gff)
This will download genome and gene files directly from Ensembl with the option --ensembl_download
-
Set parameters in the config file.
Use test_files/saccharomyces_cerevisiae-template.txt as a template.
a. Set weblinks to Ensembl genome and gene annotation files: GENE_DOWNLOAD_SITE, PEP_FILE, CHR_DOWNLOAD_SITE
Optional: DBSNP_ORGANISM_DOWNLOAD_SITE
Config file details
b. Set output paths: PARENT_DIR, ORG, ORG_VERSION
c. Set genetic codes in GENETIC_CODE_TABLE, MITO_GENETIC_CODE_TABLE
If you're working with a vertebrate, it's
GENETIC_CODE_TABLE=1
GENETIC_CODE_TABLENAME=Standard
MITO_GENETIC_CODE_TABLE=2
MITO_GENETIC_CODE_TABLENAME=Vertebrate Mitochondrial
d. Set SIFT4G paths: SIFT4G_PATH, PROTEIN_DB
If using Docker, SIFT4G_PATH=SIFT4G_PATH=/sift4g/bin/sift4g
-
Create the database:
perl make-SIFT-db-all.pl -config <config file> --ensembl_download
-
Create a config file.
a. Use test_files/homo_sapiens-test.txt as a template
cd test_files/
cp homo_sapiens-test.txt <my_org_config.txt>
b. In <my_org_config.txt>, set <PARENT_DIR>, <ORG>, <ORG_VERSION>, <SIFT4G_PATH>, <PROTEIN_DB>
Check GENETIC_CODE_TABLE and MITO_GENETIC_CODE_TABLE is set correctly.
Optional to set: <DBSNP_VCF_FILE> If using Docker,SIFT4G_PATH=SIFT4G_PATH=/sift4g/bin/sift4g
-
Put the genomic fasta files and the gene annotation files in their proper place:
a. Make the folders:
mkdir <PARENT_DIR>
mkdir <PARENT_DIR>/gene-annotation-src
mkdir <PARENT_DIR>/chr-src
mkdir <PARENT_DIR>/dbSNP
b. Move your files into the appropriate folders:
Put compressed genomic fasta files (.fa.gz) in <PARENT_DIR>/chr-src
mv *.fa.gz <PARENT_DIR>/chr-src
Put compressed gene annotation file (gtf.gz) in <PARENT_DIR>/gene-annotation-src
mv *.gtf.gz <PARENT_DIR>/gene-annotation-src
Optional: Put compressed dbSNP VCF file in <PARENT_DIR>/dbSNP
mv *.vcf.gz <PARENT_DIR>/dbSNP
Optional: Put compressed protein file (.pep.all.fa.gz) in <PARENT_DIR>/gene-annotation-src
mv *.pep.all.fa.gz <PARENT_DIR>/gene-annotation-src
This file is used for checking.
Example of the file structure can be found in test_files/homo_sapiens_small
-
Run command:
perl make-SIFT-db-all.pl -config <config_file>
WAIT...
This can take 1-24 hours, depending on the size of the genome. Ignore warnings. It's done when the computer shows:
All done!
Use this if you have a gff file (like that supplied from Phytozyme)
-
Download and install gffread
-
Convert the gene annotation .gff file to a .gtf file (because SIFT processes gtf files).
mv <gff3.gz> <PARENT_DIR>/gene-annotation-src
gunzip <*.gff3.gz>
gffread <*.gff3> -T -o [FILENAME].gene.gtf
gzip [FILENAME].gene.gtf
- Then follow instructions for building a database using a gtf file
-
Download the SIFT 4G Annotator (a Java executable) here
-
Commandline:
java -jar <Path to SIFT4G_Annotator> -c -i <Path to input vcf file> -d <Path to SIFT4G database directory> -r <Path to your results folder> -t
Note: If your database files start with a 'chr', you must rename them without the 'chr' for the annotator to work.
Example: mv chr19.gz -> 19.gz; mv chr19.regions 19.regions; mv chr19_SIFTDB_stats.txt 19_SIFTDB_stats.txt
Complete instructions here
The database is stored in <PARENT_DIR>/<ORG_VERSION> which was set in the config file.
cd <PARENT_DIR>/<ORG_VERSION>
A SIFT database is made for each chromosome in the file <chr>.gz The SIFT database structure is described here
zcat <chr>.gz | more # does it look all right?
zcat <chr>.gz | grep CDS | more
SIFT numeric scores will be in columns 10-12. If too many rows say "NA", that's a problem
Global stats are available:
more <PARENT_DIR>/<ORG_VERSION>/CHECK_GENES.LOG
The file CHECK_GENES.LOG contains a summary of SIFT predictions by chromosome. There are 4 columns:
- Chr
- Genes with SIFT Scores
- Pos with SIFT scores
- Pos with Confident Scores
The last line summarizes predictions for the entire genome:
ALL # (#/#) # (#/#) # (#/#)
Your database is done if the percentages are high for the last 3 different columns. Woohoo!
Parameter | Description |
---|---|
SIFT4G_PATH | Path to the executable sift4g (installed in step 1 of Requirements). If running with Docker, it's /sift4g/bin/sift4g |
PROTEIN_DB | Path to the protein database (.fa or .fasta) Recommend UniProt 90 |
PARENT_DIR | Path where all the output will be generated. User must have write access. |
ORG | Organism name, e.g. homo_sapiens, rat, etc. Should be one word and no special characters. |
ORG_VERSION | The final prediction database will be in <PARENT_DIR>/<ORG_VERSION> |
GENE_DOWNLOAD_SITE (Ensembl only) | URL to download gene annotation (gtf) file |
PEP_FILE (Ensembl only) | URL to download the protein sequence file. This file is optional and used for quality control. |
CHR_DOWNLOAD_SITE (Ensembl only) | URL to download the genome sequence (.fa) |
DBSNP_ORGANISM_DOWNLOAD_SITE (Ensembl only) | URL folder to download dbSNP annotations (optional) |
DBSNP_VCF_FILE (optional) | A *.vcf.gz file that will be used to annotate variants with dbSNP rs id's |
GENETIC_CODE_TABLE | Genetic code to be used to translate the DNA sequence into proteins (integer value based on NCBI)* |
GENETIC_CODE_TABLENAME (not used) | String to remind user what GENETIC_CODE_TABLE is being used |
MITO_GENETIC_CODE_TABLE | Fasta sequences named Mt, chrM, or Mito will use this genetic code. (This is for mitochondrial sequences). To disable or edit this feature, edit the function chr_is_mito in dna_protein_subs.pl |
MITO_GENETIC_CODE_TABLENAME (not used) | String to remind user what MITO_GENETIC_CODE_TABLE is being used |
PLASTID_GENETIC_CODE_TABLE | Fasta sequences named Pt, PT, chrPt, chrPT, or chloroplast will use this genetic code. (This is for chlorplasts). To disable or edit this feature, edit the function chr_is_plastid in dna_protein_subs.pl |
Because the database can take hours/days to complete, here is what to look for:
If the Terminal says .... | Check the following: |
---|---|
making single records file |
ls -lt <PARENT_DIR>/singleRecords/* is being updated |
make the fasta sequences |
`ls fasta/* |
*processing database | SIFT 4G Algorithm is running |
populating databases |
ls SIFT_predictions/* Inspect the SIFT prediction files |
ls -lt <PARENT_DIR>/singleRecords_with_scores/* is being updated |