Skip to content

Commit

Permalink
Merge pull request #34 from compbiocore/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
fernandogelin authored Jul 15, 2019
2 parents 05cf68a + 347f566 commit a179d3e
Show file tree
Hide file tree
Showing 10 changed files with 301 additions and 130 deletions.
3 changes: 2 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ os:
- linux
sudo: false
python:
- "2.7"
- "3.6"
script:
- python setup.py test
Expand All @@ -23,6 +22,8 @@ jobs:
include:
- stage: deploy docs
language: python
python:
- "3.6"
install:
- pip install mkdocs==1
- pip install mkdocs-material==3.0.3
Expand Down
118 changes: 117 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,121 @@
RefChef comes with two main commands (`refchef-cook` and `refchef-menu`). `refchef-cook` will read the recipes and execute the commands that will retrieve the references, indices, or annotations. `refchef-menu` provides an easy way to summarize the items already on the system.

- See the installation instructions for how to install refchef.
- Create your own local repository for tracking references:
```
cd /Volumes/jwalla12
git init local_references
```

- Create a directory for refchef to store your references:
```
mkdir /Volumes/jwalla12/references
```

- Create a `master.yaml` file and save it in your git repository. This file will contain the commands that will be executed to download your references, as well as some additional metadata. For more information about the details of the .yaml file format, see (https://compbiocore.github.io/refchef/specs/). Note that the creation of the `final_checksums.md5` file should always be included in the `master.yaml` file. As a minimal example, here is a `master.yaml` file that will download the grch38 human genome from Ensembl:
```
grch38:
metadata:
name: grch38_release87
species: Homo sapiens
organization: ensembl
downloader: jrwallace
levels:
references:
- component: primary
complete:
status: false
commands:
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/CHECKSUMS
- md5sum *.gz > postdownload-checksums.md5
- gunzip *.gz
- md5sum *.* > final_checksums.md5
```
- In addition to the .yaml file, you will also need to specify the following details: 1. where you'd like the references to be saved, 2. the local git repository for version control of references, and 3. the remote github repository for version control of reference sequences. There are a few options for relaying this information to refchef -- they can be specified in a `cfg.ini` file or a `cfg.yaml` file, or you can pass them as arguments to `refchef-cook` -- the command that will read your `master.yaml` file and download the references. The following is an example where arguments are passed to `refchef-cook` and references are not pushed to a remote repository:

```
refchef-cook -e -o /Volumes/jwalla12/references -gl /Volumes/jwalla12/local_references
```

todo: add examples re: using a cfg file and remote repo

- Then you'll see the following:
```
/anaconda3/lib/python3.7/site-packages/refchef/utils.py:12: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
dict_ = yaml.load(yml)
🐶 RefChef... getting reference: grch38, component: primary
Running command "wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
--2019-07-12 15:56:56-- ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
=> ‘Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.8
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.8|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/release-87/fasta/homo_sapiens/dna ... done.
==> SIZE Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ... 881214448
==> PASV ... done. ==> RETR Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ... done.
Length: 881214448 (840M) (unauthoritative)
Homo_sapiens.GRCh38.d 100%[=======================>] 840.39M 6.71MB/s in 4m 26s
2019-07-12 16:01:25 (3.16 MB/s) - ‘Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’ saved [881214448]
Running command "wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/CHECKSUMS"
--2019-07-12 16:01:25-- ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/CHECKSUMS
=> ‘CHECKSUMS’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.8
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.8|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/release-87/fasta/homo_sapiens/dna ... done.
==> SIZE CHECKSUMS ... 5010
==> PASV ... done. ==> RETR CHECKSUMS ... done.
Length: 5010 (4.9K) (unauthoritative)
CHECKSUMS 100%[=======================>] 4.89K --.-KB/s in 0s
2019-07-12 16:01:27 (97.5 MB/s) - ‘CHECKSUMS’ saved [5010]
Running command "md5sum *.gz > postdownload-checksums.md5"
Running command "gunzip *.gz"
Running command "md5sum *.* > final_checksums.md5"
```

- After this command is run, master.yaml will reflect that you have downloaded the references and it will now look like this:
```
grch38:
metadata:
name: grch38_release87
species: Homo sapiens
organization: ensembl
downloader: jrwallace
levels:
references:
- component: primary
complete:
status: true
time: 2019-07-12 16:02:25.505498
commands:
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
- wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/CHECKSUMS
- md5sum *.gz > postdownload-checksums.md5
- gunzip *.gz
- md5sum *.* > final_checksums.md5
location: /Volumes/jwalla12/references/grch38/primary
files:
- CHECKSUMS
- final_checksums.md5
- Homo_sapiens.GRCh38.dna.primary_assembly.fa
- metadata.txt
- postdownload-checksums.md5
```

todo: add information re: adding references already present elsewhere (should the command be more like a cp command?)

#### User workflow diagram

![Diagram](assets/refchef-diagram.svg)
Expand All @@ -12,7 +128,7 @@ Both scripts can take a `--config (-c)` argument with the path for a config file
config-yaml:
path-settings:
reference-directory: ~/data/references_dir # directory where references will be downloaded and processed.
github-directory: ~/data/git_local # local git repository where `master.yaml` is located.
git-directory: ~/data/git_local # local git repository where `master.yaml` is located.
remote-repository: user/repo # remote user and repository for version control of `master.yaml`
log-settings:
log: 'yes'
Expand Down
5 changes: 3 additions & 2 deletions refchef/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
pass



class Config:
def __init__(self, reference_dir, git_local, git_remote, log):
self.reference_dir = os.path.expanduser(reference_dir)
Expand All @@ -25,7 +26,7 @@ def yaml(path):
d['reference_dir'] = dict_['config-yaml']['path-settings']['reference-directory']
d['git_local'] = dict_['config-yaml']['path-settings']['git-directory']
d['git_remote'] = dict_['config-yaml']['path-settings']['remote-repository']
d['log'] = dict_['config-yaml']['log-settings']['log']
d['log'] = utils.process_logical(dict_['config-yaml']['log-settings']['log'])
# d['break_on_error'] = dict_['config-yaml']['runtime-settings']['break-on-error']
# d['verbose'] = dict_['config-yaml']['runtime-settings']['verbose']

Expand All @@ -40,7 +41,7 @@ def ini(path):
d['reference_dir'] = config.get('path-settings', 'reference-directory')
d['git_local'] = config.get('path-settings', 'git-directory')
d['git_remote'] = config.get('path-settings', 'remote-repository')
d['log'] = config.get('log-settings', 'log')
d['log'] = utils.process_logical(config.get('log-settings', 'log'))
# d['break_on_error'] = config.get('runtime-settings', 'break-on-error')
# d['verbose'] = config.get('runtime-settings', 'verbose')

Expand Down
2 changes: 2 additions & 0 deletions refchef/github_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
from refchef import config
from refchef.utils import *



def setup_git(conf):
git_dir = os.path.join(conf.git_local, '.git')
work_tree = os.path.join(conf.git_local, '')
Expand Down
26 changes: 16 additions & 10 deletions refchef/references.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
def execute(conf, file_name):
"""Process all steps to create directories, fetch files, and update yaml for
references/indices/annotations"""

yaml_file = os.path.join(conf.git_local, file_name)
yaml_dict = utils.read_yaml(yaml_file)
keys = list(yaml_dict.keys())
Expand All @@ -32,7 +33,6 @@ def execute(conf, file_name):
k,
component)
logging.info(to_print)
print(to_print)

# Fetch references
fetch(entry['commands'], path_)
Expand Down Expand Up @@ -71,7 +71,7 @@ def fetch(command_list, directory):
""" Run all commands from within the given directory"""
for c in command_list:
with cd(directory):
print("Running command \"{}\"".format(c))
logging.info("Running command \"{}\"".format(c))
subprocess.call(c, shell=True)

def get_filenames(path_):
Expand All @@ -82,14 +82,20 @@ def get_filenames(path_):

def add_uuid(path_):
"""Reads final_checksums.md5 and returns id."""
with open(os.path.join(path_, 'final_checksums.md5'), 'r') as f:
line = f.readline().replace('\n','')
if sys.platform == 'darwin':
id_ = line.split(" = ")[1]
else:
id_ = line.split(" ")[0]

return str(uuid.uuid3(uuid.NAMESPACE_DNS, id_))
if os.path.exists(os.path.join(path_, 'final_checksums.md5')):
with open(os.path.join(path_, 'final_checksums.md5'), 'r') as f:
line = f.readline().replace('\n','')
if sys.platform == 'darwin':
cs = line.split(" = ")[1]
else:
cs = line.split(" ")[0]

return str(uuid.uuid3(uuid.NAMESPACE_DNS, cs))
else:
logging.warning("No final_checksums.md found. UUID will not correspond to checksum.")
return str(uuid.uuid1())

return _id

def create_metadata_file(metadata, path_):
"""Creates metadata.txt file."""
Expand Down
1 change: 1 addition & 0 deletions refchef/table_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from refchef.github_utils import read_menu_from_github
from refchef.utils import *


def get_full_menu(master):
"""Reads yaml file and converts to a table format"""

Expand Down
1 change: 1 addition & 0 deletions refchef/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from collections import OrderedDict, defaultdict, Mapping
from future.utils import iteritems


def read_yaml(file_path):
"""Simple function to read yaml file"""
with open(file_path) as yml:
Expand Down
Loading

0 comments on commit a179d3e

Please sign in to comment.