SIGA.py is a command-line tool written in Python to generate Semantically Interoperable Genome Annotations from text files in the Generic Feature Format (GFF) according to the Resource Description Framework (RDF) specification.
- Input:
- one or more files in the GFF format (version 2 or 3)
config.ini
file with ontology mappings and feature type amendments (if applicable)
- Output: genomic features stored in a SQLite database or serialized in one of the RDF formats:
- check referential integrity for parent-child feature relationships in SQLite
- controlled vocabularies and ontologies used:
python (>=2.7)
docopt (0.6.2
RDFLib (4.2.2)
gffutils (https://github.com/arnikz/gffutils)
optional: RDF store to query ingested data using SPARQL (e.g. using Virtuoso or Berkeley DB)
git clone https://github.com/candYgene/siga.git
cd siga
virtualenv .sigaenv
source .sigaenv/bin/activate
pip install -r requirements.txt
Command-line interface
Usage:
SIGA.py -h|--help
SIGA.py -v|--version
SIGA.py db [-ruV] [-d DB_FILE | -e DB_FILEXT] GFF_FILE...
SIGA.py rdf [-V] [-o FORMAT] [-c CFG_FILE] DB_FILE...
Arguments:
GFF_FILE... Input file(s) in GFF version 2 or 3.
DB_FILE... Input database file(s) in SQLite.
Options:
-h, --help
-v, --version
-V, --verbose Show verbose output in debug mode.
-c FILE Set the path of config file [default: config.ini]
-d DB_FILE Create a database from GFF file(s).
-e DB_FILEXT Set the database file extension [default: .db].
-r Check the referential integrity of the database(s).
-u Generate unique IDs for duplicated features.
-o FORMAT Output RDF graph in one of the following formats:
turtle (.ttl) [default: turtle]
nt (.nt),
n3 (.n3),
xml (.rdf)
Input files
Small test set in examples/features.gff3
including config.ini
. Alternatively, download tomato or potato genome annotations.
wget ftp://ftp.solgenomics.net/genomes/Solanum_lycopersicum/annotation/ITAG2.4_release/ITAG2.4_gene_models.gff3
wget http://solanaceae.plantbiology.msu.edu/data/PGSC_DM_V403_genes.gff.zip
Generate RDF graph
-
GFF->DB
python SIGA.py db -rV ../examples/features.gff3 # output *.db
-
DB->RDF (default:
turtle
)python SIGA.py rdf -c config.ini ../examples/features.db # output *.ttl
Summary of I/O files:
- config file:
config.ini
- GFF file:
features.gff3
- SQLite DB file:
features.db
- RDF Turtle file:
features.ttl
Import RDF graph into Virtuoso RDF Quad Store
See the documentation on bulk data loading.
Edit virtuoso.ini
config file by adding /mydir/ to DirsAllowed.
Connect to db server as dba
user:
isql 1111 dba dba
Delete (existing) RDF graph if necessary:
SPARQL CLEAR GRAPH <http://solgenomics.net/genome/Solanum_lycopersicum> ;
Delete any previously registered data files:
DELETE FROM DB.DBA.load_list ;
Register data file(s):
ld_dir('/mydir/', 'features.ttl', 'http://solgenomics.net/genome/Solanum_lycopersicum') ;
List registered data file(s):
SELECT * FROM DB.DBA.load_list ;
Bulk data loading:
rdf_loader_run() ;
Re-index triples for full-text search (via Faceted Browser):
DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ() ;
Note: A single data file can be uploaded using the following command:
SPARQL LOAD "file:///mydir/features.ttl" INTO "http://solgenomics.net/genome/Solanum_lycopersicum" ;
Count imported RDF triples:
SPARQL
SELECT COUNT(*)
FROM <http://solgenomics.net/genome/Solanum_lycopersicum>
WHERE { ?s ?p ?o } ;
Alternatively, import RDF graph into Berkeley DB (requires Redland RDF processor)
rdfproc features parse features.ttl turtle
rdfproc features serialize turtle
The software is released under Apache License, Version 2.0.