-
Notifications
You must be signed in to change notification settings - Fork 27
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Copy new release from internal repo - GenBank format, model training.
- Loading branch information
Showing
82 changed files
with
9,495 additions
and
1,046 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,8 @@ | ||
.idea | ||
*.pyc | ||
__pycache__ | ||
/dist | ||
/build | ||
/work | ||
*.egg-info | ||
.pytest_cache |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,48 +1,61 @@ | ||
# DeepBGC: Biosynthetic Gene Cluster detection and classification. | ||
# DeepBGC: Biosynthetic Gene Cluster detection and classification | ||
|
||
## Install DeepBGC | ||
DeepBGC detects BGCs in bacterial and fungal genomes using deep learning. | ||
DeepBGC employs a Bidirectional Long Short-Term Memory Recurrent Neural Network | ||
and a word2vec-like vector embedding of Pfam protein domains. | ||
Product class and activity of detected BGCs is predicted using a Random Forest classifier. | ||
|
||
- Run `pip install deepbgc` to install the `deepbgc` python module. | ||
- **Note**: Tensorflow is not available for Python 3.7 ([link](https://github.com/tensorflow/tensorflow/issues/17022)) so please use Python 3.6 if you experience this issue. | ||
[](https://pypi.python.org/pypi/deepbgc/) | ||
 | ||
 | ||
[](https://badge.fury.io/py/deepbgc) | ||
|
||
## Prerequisities | ||
 | ||
|
||
- Install Python 3.6 (version 3.7 is not supported by TensorFlow yet) | ||
## Install using pip | ||
|
||
- Install Python version 2.7+ or 3.4+ | ||
- Install Prodigal and put the `prodigal` binary it on your PATH: https://github.com/hyattpd/Prodigal/releases | ||
- Install HMMER and put the `hmmscan` and `hmmpress` binaries on your PATH: http://hmmer.org/download.html | ||
- Download and **extract** Pfam database from: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam31.0/Pfam-A.hmm.gz | ||
- Run `pip install deepbgc` to install the DeepBGC package | ||
|
||
## Use DeepBGC | ||
|
||
### Detection | ||
### Download models and Pfam database | ||
|
||
Detect BGCs in a genomic sequence. | ||
Before you can use DeepBGC, download trained models and Pfam database: | ||
|
||
```bash | ||
# Show detection help | ||
deepbgc detect --help | ||
|
||
# Detect BGCs in a nucleotide sequence | ||
deepbgc detect --model DeepBGCDetector_v0.0.1.pkl --pfam Pfam-A.hmm --output myCandidates/ myInputSequence.fa | ||
deepbgc download | ||
``` | ||
|
||
# Detect BGCs with >0.9 score in existing Pfam CSV sequence | ||
deepbgc detect --model myModel.pkl --output myStrictCandidates/ -s 0.9 myCandidates/myCandidates.pfam.csv | ||
You can display downloaded dependencies and models using: | ||
|
||
```bash | ||
deepbgc info | ||
``` | ||
|
||
### Classification | ||
### Detection and classification | ||
|
||
Classify BGCs into one or more classes. | ||
 | ||
|
||
```bash | ||
# Show classification help | ||
deepbgc classify --help | ||
Detect and classify BGCs in a genomic sequence. | ||
Proteins and Pfam domains are detected automatically if not already annotated (HMMER and Prodigal needed) | ||
|
||
# Predict biosynthetic class of detected BGCs | ||
deepbgc classify --model RandomForestMIBiGClasses_v0.0.1.pkl --output myCandidates/myCandidates.classes.csv myCandidates/myCandidates.candidates.csv | ||
```bash | ||
# Show command help docs | ||
deepbgc pipeline --help | ||
|
||
# Detect and classify BGCs in mySequence.fa using DeepBGC algorithm and save the output to mySequence directory. | ||
deepbgc pipeline mySequence.fa | ||
``` | ||
|
||
### Trained Models | ||
This will produce a directory with multiple files and a README.txt with file descriptions. | ||
|
||
 | ||
|
||
### Model training | ||
|
||
You can train your own BGC detection and classification models, see `deepbgc train --help` for documentation and examples. | ||
|
||
The trained model files can be found in the GitHub code release [here](https://github.com/Merck/deepbgc/releases). | ||
DeepBGC positives, negatives and other training and validation data can be found on the [releases page](https://github.com/Merck/deepbgc/releases). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,2 @@ | ||
VERSION = '0.0.1' | ||
|
||
from .pipeline import DeepBGCModel | ||
from .__version__ import __version__ | ||
from .pipeline import DeepBGCClassifier, DeepBGCDetector, HmmscanPfamRecordAnnotator, DeepBGCAnnotator, ProdigalProteinRecordAnnotator |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
__version__ = '0.1.0dev' |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
import argparse | ||
|
||
|
||
class BaseCommand(object): | ||
""" | ||
Base abstract class for commands | ||
""" | ||
command = '' | ||
help = "" | ||
|
||
def add_subparser(self, subparsers): | ||
parser = subparsers.add_parser(self.command, description=self.help, help=self.help, | ||
formatter_class=argparse.RawTextHelpFormatter) | ||
parser.set_defaults(func=self) | ||
parser.add_argument('--debug', action='store_true') | ||
self.add_arguments(parser) | ||
|
||
def add_arguments(self, parser): | ||
pass | ||
|
||
def run(self, *args): | ||
raise NotImplementedError() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
from __future__ import ( | ||
print_function, | ||
division, | ||
absolute_import, | ||
) | ||
|
||
from deepbgc import util | ||
from deepbgc.command.base import BaseCommand | ||
from deepbgc.data import DOWNLOADS | ||
|
||
|
||
class DownloadCommand(BaseCommand): | ||
command = 'download' | ||
help = """Download trained models and other file dependencies to the DeepBGC downloads directory.""" | ||
|
||
def add_arguments(self, parser): | ||
pass | ||
|
||
def run(self): | ||
util.download_files(DOWNLOADS) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
from __future__ import ( | ||
print_function, | ||
absolute_import, | ||
) | ||
|
||
from deepbgc import util | ||
from deepbgc.command.base import BaseCommand | ||
import logging | ||
from datetime import datetime | ||
import os | ||
from deepbgc.models.wrapper import SequenceModelWrapper | ||
|
||
|
||
class InfoCommand(BaseCommand): | ||
command = 'info' | ||
help = """Show DeepBGC summary information about downloaded models and dependencies.""" | ||
|
||
def add_arguments(self, parser): | ||
pass | ||
|
||
def print_model(self, name, model_path): | ||
logging.info("-"*80) | ||
logging.info('Model: %s', name) | ||
try: | ||
model = SequenceModelWrapper.load(model_path) | ||
logging.info('Type: %s', type(model.model).__name__) | ||
logging.info('Version: %s', model.version) | ||
logging.info('Timestamp: %s (%s)', model.timestamp, datetime.fromtimestamp(model.timestamp).isoformat()) | ||
except Exception as e: | ||
logging.warning('Model not supported: %s', e) | ||
return False | ||
return True | ||
|
||
def run(self): | ||
ok = True | ||
custom_dir = os.environ.get(util.DEEPBGC_DOWNLOADS_DIR) | ||
if custom_dir: | ||
logging.info('Using custom downloads dir: %s', custom_dir) | ||
|
||
data_dir = util.get_downloads_dir(versioned=False) | ||
if not os.path.exists(data_dir): | ||
logging.warning('Data downloads directory does not exist yet: %s', data_dir) | ||
logging.warning('Run "deepbgc download" to download all dependencies or set %s env var', util.DEEPBGC_DOWNLOADS_DIR) | ||
ok = False | ||
else: | ||
logging.info('Available data files: %s', os.listdir(data_dir)) | ||
|
||
versioned_dir = util.get_downloads_dir(versioned=True) | ||
if not os.path.exists(versioned_dir): | ||
logging.info('Downloads directory for current version does not exist yet: %s', versioned_dir) | ||
logging.info('Run "deepbgc download" to download current models') | ||
return | ||
|
||
detectors = util.get_available_models('detector') | ||
logging.info('='*80) | ||
logging.info('Available detectors: %s', detectors) | ||
|
||
if not detectors: | ||
logging.warning('Run "deepbgc download" to download current detector models') | ||
ok = False | ||
|
||
for name in detectors: | ||
model_path = util.get_model_path(name, 'detector') | ||
ok = self.print_model(name, model_path) and ok | ||
|
||
classifiers = util.get_available_models('classifier') | ||
logging.info('='*80) | ||
logging.info('Available classifiers: %s', classifiers) | ||
|
||
for name in classifiers: | ||
model_path = util.get_model_path(name, 'classifier') | ||
ok = self.print_model(name, model_path) and ok | ||
|
||
if not classifiers: | ||
logging.warning('Run "deepbgc download" to download current classifier models') | ||
ok = False | ||
|
||
logging.info('='*80) | ||
if ok: | ||
logging.info('All OK') | ||
else: | ||
logging.warning('Some warnings detected, check the output above') |
Oops, something went wrong.