Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

04 infernal1.1 #7

Merged
merged 8 commits into from
Jan 9, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
/src
/build
/dist
*-env
*.pyc
*.egg-info
tests/*/output/*
tests/*/temp.fasta
tests/hrefpkg-build/hrefpkg
*.ssi

11 changes: 11 additions & 0 deletions CHANGES.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
=====================
Changes for deenurp
=====================

0.0.2-dev
=====

* require Infernal 1.1
* update 16S alignment model (from https://github.com/rdpstaff/fungene_pipeline/blob/eab8ab3751da687b4d6dbd553f6a1d8261d98385/resources/RRNA_16S_BACTERIA/model.cm)
* add bin/boostrap.sh to create execution environment with installed dependencies

2 changes: 1 addition & 1 deletion README.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ First, install binary dependencies:
``cat requirements.txt | xargs -n 1 pip install``

* uclust 1.1
* Infernal version 1.0.2, (http://infernal.janelia.org/)
* Infernal version 1.1, (http://infernal.janelia.org/)
* pplacer suite (http://matsen.fhcrc.org/pplacer)
* FastTree 2 (http://www.microbesonline.org/fasttree/#Install)

Expand Down
114 changes: 114 additions & 0 deletions bin/bootstrap.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
#!/bin/bash

# Usage: [PYTHON=/path/to/python] bootstrap.sh [virtualenv-name]
#
# Create a virtualenv, and install requirements to it.
#
# specify a python interpreter using
# `PYTHON=/path/to/python bootstrap.sh`

set -e

srcdir(){
tar -tf $1 | head -1
}

if [[ -z $1 ]]; then
venv=$(basename $(pwd))-env
else
venv=$1
fi

if [[ -z $PYTHON ]]; then
PYTHON=$(which python)
fi

mkdir -p src

VENV_VERSION=1.10.1
PPLACER_VERSION=1.1
INFERNAL_VERSION=1.1
UCLUST_VERSION=1.2.22

# Create a virtualenv using a specified version of the virtualenv
# source. This also provides setuptools and pip. Inspired by
# http://eli.thegreenplace.net/2013/04/20/bootstrapping-virtualenv/
VENV_URL='http://pypi.python.org/packages/source/v/virtualenv'

# download virtualenv source if necessary
if [ ! -f src/virtualenv-${VENV_VERSION}/virtualenv.py ]; then
(cd src && \
wget -N ${VENV_URL}/virtualenv-${VENV_VERSION}.tar.gz && \
tar -xf virtualenv-${VENV_VERSION}.tar.gz)
fi

# create virtualenv if necessary
if [ ! -f $venv/bin/activate ]; then
$PYTHON src/virtualenv-${VENV_VERSION}/virtualenv.py $venv
$PYTHON src/virtualenv-${VENV_VERSION}/virtualenv.py --relocatable $venv
else
echo "found existing virtualenv $venv"
fi

source $venv/bin/activate

# install python requirements; note that `pip install -r
# requirements.txt` fails due to install-time dependencies.
while read line; do
pip install -U "$line"
done < requirements.txt

# install deenurp
pip install -e .

# install pplacer and accompanying python scripts
PPLACER_TGZ=pplacer-v${PPLACER_VERSION}-Linux.tar.gz
if [ ! -f $venv/bin/pplacer ]; then
(cd src && \
wget -N http://matsen.fhcrc.org/pplacer/builds/$PPLACER_TGZ && \
tar -xf $PPLACER_TGZ && \
cp $(srcdir $PPLACER_TGZ)/{pplacer,guppy,rppr} ../$venv/bin && \
pip install -U $(srcdir $PPLACER_TGZ)/scripts && \
rm -r $(srcdir $PPLACER_TGZ))
else
echo "$(pplacer --version) is already installed"
fi

# install infernal and easel binaries
INFERNAL=infernal-${INFERNAL_VERSION}-linux-intel-gcc
venv_abspath=$(readlink -f $venv)

if [ ! -f $venv/bin/cmalign ]; then
(cd src && \
wget -N http://selab.janelia.org/software/infernal/${INFERNAL}.tar.gz && \
for binary in cmalign cmconvert esl-alimerge esl-sfetch; do
tar xvf ${INFERNAL}.tar.gz --no-anchored binaries/$binary
done && \
cp ${INFERNAL}/binaries/* ../$venv/bin && \
rm -r ${INFERNAL}
)
else
echo "cmalign is already installed: $(cmalign -h | sed -n 2p)"
fi

# install uclust
if [ ! -f $venv/bin/uclust ]; then
(cd $venv/bin && \
wget -N http://drive5.com/uclust/uclustq${UCLUST_VERSION}_i86linux64 && \
chmod +x uclustq${UCLUST_VERSION}_i86linux64 && \
ln -f uclustq${UCLUST_VERSION}_i86linux64 uclust)
else
echo "$(uclust --version) is already installed"
fi

# install FastTree
if [ ! -f $venv/bin/FastTree ]; then
(cd $venv/bin && \
wget -N http://www.microbesonline.org/fasttree/FastTree && \
chmod +x FastTree)
else
echo "FastTree is already installed: $(FastTree -expert 2>&1 | head -1)"
fi

# correct any more shebang lines
$PYTHON src/virtualenv-${VENV_VERSION}/virtualenv.py --relocatable $venv
9,899 changes: 9,899 additions & 0 deletions deenurp/data/RRNA_16S_BACTERIA.cm

Large diffs are not rendered by default.

5,573 changes: 0 additions & 5,573 deletions deenurp/data/bacteria16S_508_mod5.cm

This file was deleted.

4 changes: 2 additions & 2 deletions deenurp/select.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ def select_sequences_for_cluster(ref_seqs, query_seqs, cluster_name,

c = itertools.chain(ref_seqs, query_seqs)
ref_ids = frozenset(i.id for i in ref_seqs)
aligned = list(cmalign(c, mpi_args=None))
aligned = list(cmalign(c))
with as_refpkg((i for i in aligned if i.id in ref_ids), threads=1) as rp, \
as_fasta(aligned) as fasta, \
tempdir(prefix='jplace') as placedir, \
Expand Down Expand Up @@ -108,7 +108,7 @@ def select_sequences_for_whitelist_cluster(ref_seqs, cluster_name, keep_leaves=5
if len(ref_seqs) <= keep_leaves:
return ref_seqs

aligned = list(cmalign(ref_seqs, mpi_args=None))
aligned = list(cmalign(ref_seqs))
with util.ntf(suffix='.tre') as tf:
wrap.fasttree(aligned, tf, gtr=True)
tf.close()
Expand Down
2 changes: 1 addition & 1 deletion deenurp/subcommands/filter_outliers.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ def filter_sequences(sequence_file, cutoff):
open(os.devnull) as devnull:
# Align
wrap.cmalign_files(sequence_file, a_sto.name,
stdout=devnull, mpi_args=None)
stdout=devnull)
# FastTree requires FASTA
SeqIO.convert(a_sto, 'stockholm', a_fasta, 'fasta')
a_fasta.flush()
Expand Down
2 changes: 1 addition & 1 deletion deenurp/subcommands/hrefpkg_build.py
Original file line number Diff line number Diff line change
Expand Up @@ -344,7 +344,7 @@ def tax_id_refpkg(tax_id, full_tax, seqinfo, sequence_file,
wrap.esl_sfetch(sequence_file, test_seq_ids, test_file)

# Cmalign
aligned = wrap.cmalign(sequences, output=sto_fp, mpi_args=None)
aligned = wrap.cmalign(sequences, output=sto_fp)
aligned = list(aligned)
assert aligned
# Tree
Expand Down
11 changes: 8 additions & 3 deletions deenurp/test/test_wrap.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,19 @@ class CmAlignTestCase(unittest.TestCase):
def setUp(self):
self.sequences = list(SeqIO.parse(util.data_path('test_input.fasta'), 'fasta'))

def test_nompi(self):
def test_oneproc(self):
result = list(wrap.cmalign(self.sequences))
self.assertEqual(len(self.sequences), len(result))

def test_mpi(self):
result = list(wrap.cmalign(self.sequences, mpi_args=['-np', '2']))
def test_twoproc(self):
result = list(wrap.cmalign(self.sequences, cpu=2))
self.assertEqual(len(self.sequences), len(result))

def test_defaultproc(self):
result = list(wrap.cmalign(self.sequences))
self.assertEqual(len(self.sequences), len(result))


class CMTestCase(unittest.TestCase):
def test_find_cm(self):
self.assertTrue(os.path.isfile(wrap.CM))
Expand Down
42 changes: 15 additions & 27 deletions deenurp/wrap.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,21 @@
import os
import os.path
import subprocess
import sys

from Bio import SeqIO
import peasel
from taxtastic.refpkg import Refpkg

from .util import as_fasta, ntf, tempdir, nothing, maybe_tempfile, which, require_executable

DEFAULT_CMALIGN_THREADS = 1

"""Path to item in data directory"""
data_path = functools.partial(os.path.join, os.path.dirname(__file__), 'data')

"""16S bacterial covariance model"""
CM = data_path('bacteria16S_508_mod5.cm')
CM = data_path('RRNA_16S_BACTERIA.cm')

@contextlib.contextmanager
def as_refpkg(sequences, name='temp.refpkg', threads=None):
Expand Down Expand Up @@ -159,46 +162,31 @@ def rppr_min_adcl_tree(newick_file, leaves, algorithm='pam',
output = subprocess.check_output(cmd)
return output.splitlines()

def _cmalign_has_mpi():
"""
Returns whether cmalign was compiled with MPI support
"""
require_executable('cmalign')
o = subprocess.check_output(['cmalign', '-h'])
return '--mpi' in o

def cmalign_files(input_file, output_file, mpi_args=None, cm=CM,
stdout=None):
has_mpi = _cmalign_has_mpi()
if (mpi_args is not None) and not has_mpi:
logging.warn('MPI arguments %s passed to cmalign_files, '
'but cmalign does not appear to have MPI support. '
'Running without MPI.',
mpi_args)
if mpi_args is not None and has_mpi:
cmd = ['mpirun'] + mpi_args + ['cmalign', '--mpi']
else:
cmd = ['cmalign']

def cmalign_files(input_file, output_file, cm=CM, cpu=None, stdout=None):
cmd = ['cmalign']
require_executable(cmd[0])
cmd.extend(['--sub', '-1', '--dna', '--hbanded'])
cmd.extend(['--noprob', '--dnaout'])
if cpu is not None:
cmd.extend(['--cpu', str(cpu)])

cmd.extend(['-o', output_file, cm, input_file])
logging.debug(' '.join(cmd))
subprocess.check_call(cmd, stdout=stdout)
subprocess.check_call(cmd, stdout=stdout, stderr=sys.stderr)


def cmalign(sequences, output=None, mpi_args=None, cm=CM):
def cmalign(sequences, output=None, cm=CM, cpu=DEFAULT_CMALIGN_THREADS):
"""
Run cmalign

If mpi_args is specified, run via mpirun
"""
with as_fasta(sequences) as fasta, open(os.devnull) as devnull, \
maybe_tempfile(output, prefix='cmalign', suffix='.sto', dir='.') as tf:
cmalign_files(fasta, tf.name, mpi_args=mpi_args, stdout=devnull, cm=cm)
cmalign_files(fasta, tf.name, stdout=devnull, cm=cm, cpu=cpu)

for sequence in SeqIO.parse(tf, 'stockholm'):
yield sequence


def esl_sfetch(sequence_file, name_iter, output_fp, use_temp=False):
"""
Fetch sequences named in name_iter from sequence_file, indexing if
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
numpy>=1.5.0
biopython>=1.58
cogent>=1.5.1
svn+https://tax2tree.svn.sourceforge.net/svnroot/tax2tree/trunk
svn+https://svn.code.sf.net/p/tax2tree/code/trunk
git+https://github.com/cmccoy/peasel.git@b1d5783d8d6d56cb0274e7977d093583c9c5f968
taxtastic==0.4.0
futures>=2.0
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def run(self):
install_requires = []

setup(name='deenurp',
version='0.0.1',
version='0.0.2-dev',
package_data={'deenurp': ['data/*', 'test/data/*']},
entry_points={'console_scripts': {'deenurp = deenurp.scripts.deenurp:main'}},
install_requires=install_requires,
Expand Down
8 changes: 7 additions & 1 deletion tests/filter-outliers/run.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,10 @@
#!/bin/bash

set -e

rm -rf output
mkdir output

BASE=../rdp_10_30_named1200bp_subset
DEENURP=${DEENURP-../../deenurp.py}
$DEENURP filter-outliers $BASE.fasta $BASE.seqinfo.csv $BASE.taxonomy.csv filtered.fasta --filtered-seqinfo filtered.seqinfo.csv
$DEENURP filter-outliers $BASE.fasta $BASE.seqinfo.csv $BASE.taxonomy.csv output/filtered.fasta --filtered-seqinfo output/filtered.seqinfo.csv
6 changes: 5 additions & 1 deletion tests/hrefpkg-build/run.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
#!/bin/bash

set -e

BASE=../rdp_10_30_named1200bp_subset
mkdir -p hrefpkg
rm -rf hrefpkg
mkdir hrefpkg
DEENURP=${DEENURP-../../deenurp.py}
$DEENURP hrefpkg-build --index-rank=family $BASE.fasta $BASE.seqinfo.csv $BASE.taxonomy.csv --output-dir hrefpkg --threads 6
6 changes: 6 additions & 0 deletions tests/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash

for f in $(find . -wholename ./run.sh -prune -o -name run.sh -print); do
echo $(dirname $f)
(cd $(dirname $f) && ./run.sh)
done
7 changes: 5 additions & 2 deletions tests/search-select/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,8 @@ set -e
RDP_BASE=../rdp_10_30_named1200bp_subset
DEENURP=${DEENURP-../../deenurp.py}

$DEENURP search-sequences ground-reads.fa search.db $RDP_BASE.fasta $RDP_BASE.seqinfo.csv --group-field=tax_id --blacklist=blacklist.txt
$DEENURP select-references search.db refs.fasta --seqinfo-out refs.seqinfo.csv --output-meta refs.meta.csv --min-mass-prop 0.01 --whitelist whitelist.txt
rm -rf output
mkdir -p output

$DEENURP search-sequences ground-reads.fa output/search.db $RDP_BASE.fasta $RDP_BASE.seqinfo.csv --group-field=tax_id --blacklist=blacklist.txt
$DEENURP select-references output/search.db output/refs.fasta --seqinfo-out output/refs.seqinfo.csv --output-meta output/refs.meta.csv --min-mass-prop 0.01 --whitelist whitelist.txt