Skip to content

Introduction to Back End

Bernice-B edited this page Jun 10, 2016 · 1 revision

The back-end mostly consists of the python uploader for data, Blazegraph, and Flask (as the intermediate between database and front-end).

#Uploader The uploader is responsible for uploading data to Blazegraph, our triplestore database.

Uploader structure

Most of the back-end is in the folder superphy/src/upload

upload
├── ontology
├── python
│   ├── classes
│   ├── data
│   ├── ontologies -> ../ontology
│   ├── outputs
│   ├── release-rgi-v3.0.1
│   ├── samples
│   └── tmp
├── tests
└── uml

###data/ Contains JSON files with information on hosts, host_categories, microbes, sources, and syndromes for pre-loading into the database, as well as the gene data JSON files. A validation database for NCBI BLAST is also present.

Note: if you are missing the superphy_vf.xml file in this repository when you run the uploader, you can retrieve it from the NAS. It is a virulence factor BLAST result file.

###ontology/ All ontologies used in SuperPhy. More is described below.

###outputs/ Where error logs are sent. Error logging is currently done through file IO statements, but it could be replaced with Python's logging module.

###python/ Python scripts for the uploader. There are also python libs for retrieval that internally use sparql

####Workflow Currently the workflow is divided into separate files.

  • main.py: Initalizes the Blazegraph namespace and uploads all ontologies to the database.
  • metadata_upload.py: Uploads sample genome files and gene files.
    • for the gene uploads, data/superphy_vf.json is the virulence factor file and data/card.json is the AMR gene one.
  • contig_upload.py: Uploads contigs for all the genomes without contigs that are uploaded into the database by downloading the sequence FASTA file. Also performs sequence validation using methods from sequence_validation.py
  • gene_location_upload.py: Performs the gene identification analysis. Reference genes must be found for virulence factors, and then they are BLASTed for gene identification. AMR gene analysis uses the RGI from CARD.

Examples can be found at the bottom of each file.

###release-rgi-v3.0.1/ Folder for RGI. Documentation can be found in this folder's README.

###samples/ Sample files for uploading.

###tests/ Unit tests for uploader. Some of the tests are run on the assumption that Blazegraph has no data in it.

###uml/ UML diagrams for the classes in the uploader

UMLs

Note that not all the relationships between classes are shown since the UML of the entire project is broken up into categories below

Ontologies

Ontologies are almost like the backbone of the triplestore database, as they lay out the models and relationships for the data. We've been using Protégé to edit our ontologies as it provides a nice user interface. There is a tutorial available in the resources page.

Flask

###Notes (i.e. haven't really figured out how to organize these thoughts yet)

  • Format the data in Flask using Python when sending/getting requests, not in Mithril.

SLURM (Simple Linux Utility for Resource Management)