This is a set of scripts that create a Synteruptor database from a set of genome files.
We recommend runnning the Docker image as it is the most straightforward way to run this program.
See the README file in the docker/
folder for steps to build an image.
If you want to install it in your own machine, here's the list of requirements.
Assuming Ubuntu 22.04:
git parallel rename bioperl libstatistics-basic-perl ncbi-blast+ sqlite3
The programs and libraries required to run the scripts are as follow:
-
Perl
-
Bioperl
Hint for local installs if you don't have it in a package: use cpanminus
curl -L http://cpanmin.us | perl - App::cpanminus
curl -L http://cpanmin.us | perl - Bio::Perl
-
Statistics::Basic (libstatistics-basic-perl)
-
GNU parallel (parallel)
-
Blast+ (ncbi-blast+)
-
Python 3
-
Python 3 libs (either from a package or with manager like pip):
sqlite3
Add the path to the src/
folder to your PATH
Synteruptor requires each genome assembly to be in a file each with both DNA sequence and annotations.
File formats supported are:
- Genbank (.gb*)
- EMBL (.dat, .embl, .txt)
Make sure you have at least 2 genome files to compare.
- Move to a working directory
- Place genomic files in a subfolder NB: intermediate files will be created in that subfolder
- Run the script run_gbk.sh
run_gbk.sh -i /path/to/subfolder -n db_name
You can run this script with multiple threads by adding the parameter -j
. E.g. to run on 4 cores:
run_gbk.sh -i /path/to/subfolder -n db_name -j 4
- It will then create a database named db_name.sqlite in the subfolder, as well as a Blast DB db_name.sqlite.faa
- Place the DB in the
db
folder of the Web Synteruptor to explore its data. Make sure the web server has permission to read thedb
folder and the database file.
As an example, download the gbff files for 3 genomes from NCBI:
- Streptococcus anginosus C1051 (GCF_000463465.1)
- Streptococcus anginosus C238 (GCF_000463505.1)
- Streptococcus anginosus subsp. whileyi MAS624 (GCA_000478925.1) You can do this through this link.
Unzip the file and move the .gbff files in a subfolder, make sure to rename each file to something meaningful.
Then run the script on this subfolder as above (assuming 4 cores):
run_gbk.sh -i /path/to/subfolder -n db_name -j 4
The database contains the following tables at initiation:
genes
genomes
genome_parts
(separate DNA sequences) The data in those tables come from the input files.
When running run_migenis.sh the following tables are populated:
orthos
(orthologs and paralogs genes pairs, from BLAST BRH)info
(metadata)pairs
(orthologs and paralog pairs)blocks
(synteny blocks)breaks
(synteny breaks)breaks_genes
(all genes in a break, per species)breaks_ranking
breaks_graph
(to represent similar breaks among a group of species)goc
Additional tables and views created to ease the queries:
orthos_all
(joined pairs with genes species1 and genes species2)blocks_all
(joined block, orthos pairs at start of block, end of block + genes at start and end in both species)breaks_all
(joined break, blocks left and right, orthos pairs left and right, genes at start and end un both species)
The final breaks data are stored in breaks (breaks_all for more data) and breaks_ranking (contains the various attributes used for ranking).
You can use sqlite3 .schema
command to see the details of each table.