Skip to content

Commit

Permalink
Copy new release from internal repo - GenBank format, model training.
Browse files Browse the repository at this point in the history
  • Loading branch information
prihoda committed Mar 7, 2019
1 parent f754f60 commit 66a62a8
Show file tree
Hide file tree
Showing 82 changed files with 9,495 additions and 1,046 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
.idea
*.pyc
__pycache__
/dist
/build
/work
*.egg-info
.pytest_cache
3 changes: 3 additions & 0 deletions LICENSES_THIRD_PARTY
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ BSD 3-Clause License

MIT License (MIT)
* Keras (keras) - https://github.com/keras-team/keras/blob/dc698c5486117780b643eda0a2f60a8753625b8a/LICENSE
* appdirs (appdirs) - https://github.com/ActiveState/appdirs/blob/71eca9837f82857fe4f52598901923df05340cb1/LICENSE.txt
* PyTest (pytest) - https://github.com/pytest-dev/pytest/blob/7f67158/LICENSE
* PyTest-mock (pytest-mock) - https://github.com/pytest-dev/pytest-mock/blob/ab46694/LICENSE

Apache Software License (Apache 2.0)
* TensorFlow (tensorflow) - https://github.com/tensorflow/tensorflow/blob/6b6d843ccab78f9f91c3b98a43ca09ffecad4747/LICENSE
Expand Down
63 changes: 38 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,61 @@
# DeepBGC: Biosynthetic Gene Cluster detection and classification.
# DeepBGC: Biosynthetic Gene Cluster detection and classification

## Install DeepBGC
DeepBGC detects BGCs in bacterial and fungal genomes using deep learning.
DeepBGC employs a Bidirectional Long Short-Term Memory Recurrent Neural Network
and a word2vec-like vector embedding of Pfam protein domains.
Product class and activity of detected BGCs is predicted using a Random Forest classifier.

- Run `pip install deepbgc` to install the `deepbgc` python module.
- **Note**: Tensorflow is not available for Python 3.7 ([link](https://github.com/tensorflow/tensorflow/issues/17022)) so please use Python 3.6 if you experience this issue.
[![PyPI license](https://img.shields.io/pypi/l/deepbgc.svg)](https://pypi.python.org/pypi/deepbgc/)
![PyPI - Downloads](https://img.shields.io/pypi/dm/deepbgc.svg?color=green&label=pypi%20downloads)
![GitHub Releases](https://img.shields.io/github/downloads/Merck/deepbgc/latest/total.svg?label=GitHub%20downloads)
[![PyPI version](https://badge.fury.io/py/deepbgc.svg)](https://badge.fury.io/py/deepbgc)

## Prerequisities
![DeepBGC architecture](images/deepbgc.architecture.png?raw=true "DeepBGC architecture")

- Install Python 3.6 (version 3.7 is not supported by TensorFlow yet)
## Install using pip

- Install Python version 2.7+ or 3.4+
- Install Prodigal and put the `prodigal` binary it on your PATH: https://github.com/hyattpd/Prodigal/releases
- Install HMMER and put the `hmmscan` and `hmmpress` binaries on your PATH: http://hmmer.org/download.html
- Download and **extract** Pfam database from: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam31.0/Pfam-A.hmm.gz
- Run `pip install deepbgc` to install the DeepBGC package

## Use DeepBGC

### Detection
### Download models and Pfam database

Detect BGCs in a genomic sequence.
Before you can use DeepBGC, download trained models and Pfam database:

```bash
# Show detection help
deepbgc detect --help

# Detect BGCs in a nucleotide sequence
deepbgc detect --model DeepBGCDetector_v0.0.1.pkl --pfam Pfam-A.hmm --output myCandidates/ myInputSequence.fa
deepbgc download
```

# Detect BGCs with >0.9 score in existing Pfam CSV sequence
deepbgc detect --model myModel.pkl --output myStrictCandidates/ -s 0.9 myCandidates/myCandidates.pfam.csv
You can display downloaded dependencies and models using:

```bash
deepbgc info
```

### Classification
### Detection and classification

Classify BGCs into one or more classes.
![DeepBGC pipeline](images/deepbgc.pipeline.png?raw=true "DeepBGC pipeline")

```bash
# Show classification help
deepbgc classify --help
Detect and classify BGCs in a genomic sequence.
Proteins and Pfam domains are detected automatically if not already annotated (HMMER and Prodigal needed)

# Predict biosynthetic class of detected BGCs
deepbgc classify --model RandomForestMIBiGClasses_v0.0.1.pkl --output myCandidates/myCandidates.classes.csv myCandidates/myCandidates.candidates.csv
```bash
# Show command help docs
deepbgc pipeline --help

# Detect and classify BGCs in mySequence.fa using DeepBGC algorithm and save the output to mySequence directory.
deepbgc pipeline mySequence.fa
```

### Trained Models
This will produce a directory with multiple files and a README.txt with file descriptions.

![Detected BGC Regions](images/deepbgc.bgc.png?raw=true "Detected BGC regions")

### Model training

You can train your own BGC detection and classification models, see `deepbgc train --help` for documentation and examples.

The trained model files can be found in the GitHub code release [here](https://github.com/Merck/deepbgc/releases).
DeepBGC positives, negatives and other training and validation data can be found on the [releases page](https://github.com/Merck/deepbgc/releases).
5 changes: 2 additions & 3 deletions deepbgc/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
VERSION = '0.0.1'

from .pipeline import DeepBGCModel
from .__version__ import __version__
from .pipeline import DeepBGCClassifier, DeepBGCDetector, HmmscanPfamRecordAnnotator, DeepBGCAnnotator, ProdigalProteinRecordAnnotator
1 change: 1 addition & 0 deletions deepbgc/__version__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__version__ = '0.1.0dev'
File renamed without changes.
22 changes: 22 additions & 0 deletions deepbgc/command/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import argparse


class BaseCommand(object):
"""
Base abstract class for commands
"""
command = ''
help = ""

def add_subparser(self, subparsers):
parser = subparsers.add_parser(self.command, description=self.help, help=self.help,
formatter_class=argparse.RawTextHelpFormatter)
parser.set_defaults(func=self)
parser.add_argument('--debug', action='store_true')
self.add_arguments(parser)

def add_arguments(self, parser):
pass

def run(self, *args):
raise NotImplementedError()
20 changes: 20 additions & 0 deletions deepbgc/command/download.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from __future__ import (
print_function,
division,
absolute_import,
)

from deepbgc import util
from deepbgc.command.base import BaseCommand
from deepbgc.data import DOWNLOADS


class DownloadCommand(BaseCommand):
command = 'download'
help = """Download trained models and other file dependencies to the DeepBGC downloads directory."""

def add_arguments(self, parser):
pass

def run(self):
util.download_files(DOWNLOADS)
82 changes: 82 additions & 0 deletions deepbgc/command/info.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
from __future__ import (
print_function,
absolute_import,
)

from deepbgc import util
from deepbgc.command.base import BaseCommand
import logging
from datetime import datetime
import os
from deepbgc.models.wrapper import SequenceModelWrapper


class InfoCommand(BaseCommand):
command = 'info'
help = """Show DeepBGC summary information about downloaded models and dependencies."""

def add_arguments(self, parser):
pass

def print_model(self, name, model_path):
logging.info("-"*80)
logging.info('Model: %s', name)
try:
model = SequenceModelWrapper.load(model_path)
logging.info('Type: %s', type(model.model).__name__)
logging.info('Version: %s', model.version)
logging.info('Timestamp: %s (%s)', model.timestamp, datetime.fromtimestamp(model.timestamp).isoformat())
except Exception as e:
logging.warning('Model not supported: %s', e)
return False
return True

def run(self):
ok = True
custom_dir = os.environ.get(util.DEEPBGC_DOWNLOADS_DIR)
if custom_dir:
logging.info('Using custom downloads dir: %s', custom_dir)

data_dir = util.get_downloads_dir(versioned=False)
if not os.path.exists(data_dir):
logging.warning('Data downloads directory does not exist yet: %s', data_dir)
logging.warning('Run "deepbgc download" to download all dependencies or set %s env var', util.DEEPBGC_DOWNLOADS_DIR)
ok = False
else:
logging.info('Available data files: %s', os.listdir(data_dir))

versioned_dir = util.get_downloads_dir(versioned=True)
if not os.path.exists(versioned_dir):
logging.info('Downloads directory for current version does not exist yet: %s', versioned_dir)
logging.info('Run "deepbgc download" to download current models')
return

detectors = util.get_available_models('detector')
logging.info('='*80)
logging.info('Available detectors: %s', detectors)

if not detectors:
logging.warning('Run "deepbgc download" to download current detector models')
ok = False

for name in detectors:
model_path = util.get_model_path(name, 'detector')
ok = self.print_model(name, model_path) and ok

classifiers = util.get_available_models('classifier')
logging.info('='*80)
logging.info('Available classifiers: %s', classifiers)

for name in classifiers:
model_path = util.get_model_path(name, 'classifier')
ok = self.print_model(name, model_path) and ok

if not classifiers:
logging.warning('Run "deepbgc download" to download current classifier models')
ok = False

logging.info('='*80)
if ok:
logging.info('All OK')
else:
logging.warning('Some warnings detected, check the output above')
Loading

0 comments on commit 66a62a8

Please sign in to comment.