Skip to content

Commit

Permalink
star: Add optional arguments to control memory usage, related to #134
Browse files Browse the repository at this point in the history
With arguments genome_sasparsed and genome_saindexnbases one can control STAR's memory requirements and usage.
  • Loading branch information
tomazc committed Sep 20, 2017
1 parent 915c976 commit 14c0421
Show file tree
Hide file tree
Showing 5 changed files with 42 additions and 16 deletions.
13 changes: 11 additions & 2 deletions docs/source/ref_CLI.txt
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,8 @@ indexstar
=========

usage: iCount indexstar [-h] [-a] [--overhang] [--overhang_min] [--threads]
[-S] [-F] [-P] [-M]
[--genome_sasparsed] [--genome_saindexnbases] [-S]
[-F] [-P] [-M]
genome genome_index

Generate STAR genome index.
Expand All @@ -255,8 +256,16 @@ optional arguments:
-a , --annotation Annotation that defines splice junctions (default: )
--overhang Sequence length around annotated junctions to be used by STAR when
constructing splice junction database (default: 100)
--overhang_min TODO (default: 8)
--overhang_min Minimum overhang for unannotated junctions (default: 8)
--threads Number of threads that STAR can use for generating index (default: 1)
--genome_sasparsed STAR parameter genomeSAsparseD.
Suffix array sparsity. Bigger numbers decrease RAM requirements
at the cost of mapping speed reduction. Suggested values
are 1 (30 GB RAM) or 2 (16 GB RAM) (default: 1)
--genome_saindexnbases
STAR parameter genomeSAindexNbases.
SA pre-indexing string length, typically between 10 and 15.
Longer strings require more memory, but result in faster searches (default: 14)
-S , --stdout_log Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
-F , --file_log Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
-P , --file_logpath Path to log file.
Expand Down
14 changes: 9 additions & 5 deletions docs/source/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,15 +33,15 @@ iCLIP sequencing reads must be mapped to a reference genome. The user can prepar
Another option is to download a release from `ensembl`_. You can use the command ``releases`` to
get a list of available releases supported by **iCount**::

$ iCount releases
$ iCount releases --source ensembl

There are 30 releases available: 88,87,86,85,84,83,82,81,80,79,78,77,76,75,74,73,
72,71,70,69,68,67,66,65,64,63,62,61,60,59


You can then use the command ``species`` to get a list of species available in a release::

$ iCount species -r 88
$ iCount species --source ensembl -r 88

There are 87 species available: ailuropoda_melanoleuca,anas_platyrhynchos,
ancestral_alleles,anolis_carolinensis,astyanax_mexicanus,bos_taurus,
Expand All @@ -55,7 +55,7 @@ You can then use the command ``species`` to get a list of species available in a

Let's download the human genome sequence from release 88::

$ iCount genome homo_sapiens -r 88 --chromosomes 21 MT
$ iCount genome --source ensembl homo_sapiens -r 88 --chromosomes 21 MT

Downloading FASTA file into: /..././homo_sapiens.88.chr21_MT.fa.gz
Fai file saved to : /..././iCount/homo_sapiens.88.chr21_MT.fa.gz.fai
Expand All @@ -67,7 +67,7 @@ Let's download the human genome sequence from release 88::

And the annotation of the human genome from release 88::

$ iCount annotation homo_sapiens -r 88
$ iCount annotation --source ensembl homo_sapiens -r 88

Downloading GTF to: /..././homo_sapiens.88.gtf.gz
Done.
Expand All @@ -77,7 +77,7 @@ The next step is to generate a genome index that is used by `STAR`_ mapper. Let'

$ mkdir hs88 # folder should be empty
$ iCount indexstar homo_sapiens.88.chr21_MT.fa.gz hs88 \
--annotation homo_sapiens.88.gtf.gz
--annotation homo_sapiens.88.gtf.gz --genome_sasparsed 2 --genome_saindexnbases 13

Building genome index with STAR for genome homo_sapiens.88.fa.gz
<timestamp> ..... Started STAR run
Expand All @@ -99,6 +99,10 @@ The next step is to generate a genome index that is used by `STAR`_ mapper. Let'
A subfolder ``hs88`` will be created in current working directory. You can specify
alternative relative or absolute paths, e.g., ``indexes/hs88``.

.. note::
Changing the parameters ``genome_sasparsed`` and ``genome_saindexnbases`` results into
lower memory requirements but longer run times.

We are now ready to start mapping iCLIP data to the human genome!

.. _`ensembl`:
Expand Down
4 changes: 2 additions & 2 deletions iCount/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ def _extract_parameter_data(function):
Every parameter in returned object can have the following entries:
* name - the name of parameter, preceeded by '--' if it is optional
* name - the name of parameter, preceded by '--' if it is optional
* default - the default value (only for optional parameters). Extracted
from function signature.
* type - type of parameter, extracted from function docstring. If not
Expand Down Expand Up @@ -391,7 +391,7 @@ def verbose_help(mode):

# all_args command:
def all_args():
"""Print all posssible parameter names and CLI commands where they are used."""
"""Print all possible parameter names and CLI commands where they are used."""
for param_name, commands in sorted(PARAMETERS.items(), key=lambda x: x[0].lstrip('-')):
if param_name in SHORT_OPTARG_NAMES:
short_name = ' ({})'.format(SHORT_OPTARG_NAMES[param_name])
Expand Down
11 changes: 6 additions & 5 deletions iCount/examples/tutorial.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,17 @@ set -vx
mkdir tutorial_example
cd tutorial_example

iCount releases
iCount releases --source ensembl

iCount species -r 88
iCount species --source ensembl -r 88

iCount genome homo_sapiens -r 88 --chromosomes 21 MT
iCount genome --source ensembl homo_sapiens 88 --chromosomes 21 MT

iCount annotation homo_sapiens -r 88
iCount annotation --source ensembl homo_sapiens 88

mkdir hs88
iCount indexstar homo_sapiens.88.chr21_MT.fa.gz hs88 --annotation homo_sapiens.88.gtf.gz
iCount indexstar homo_sapiens.88.chr21_MT.fa.gz hs88 \
--annotation homo_sapiens.88.gtf.gz --genome_sasparsed 2 --genome_saindexnbases 13

# the whole data set [880 MB] is available here:
#wget http://icount.fri.uni-lj.si/data/20101116_LUjh03/\
Expand Down
16 changes: 14 additions & 2 deletions iCount/externals/star.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,8 @@ def get_version():
return None


def build_index(genome, genome_index, annotation='', overhang=100, overhang_min=8, threads=1):
def build_index(genome, genome_index, annotation='', overhang=100, overhang_min=8, threads=1,
genome_sasparsed=1, genome_saindexnbases=14):
"""
Call STAR to generate genome index, which is used for mapping.
Expand All @@ -74,9 +75,18 @@ def build_index(genome, genome_index, annotation='', overhang=100, overhang_min=
Sequence length around annotated junctions to be used by STAR when
constructing splice junction database.
overhang_min : int
TODO
Minimum overhang for unannotated junctions.
threads : int
Number of threads that STAR can use for generating index.
genome_sasparsed : int
STAR parameter genomeSAsparseD.
Suffix array sparsity. Bigger numbers decrease RAM requirements
at the cost of mapping speed reduction. Suggested values
are 1 (30 GB RAM) or 2 (16 GB RAM).
genome_saindexnbases : int
STAR parameter genomeSAindexNbases.
SA pre-indexing string length, typically between 10 and 15.
Longer strings require more memory, but result in faster searches.
Returns
-------
Expand All @@ -95,6 +105,8 @@ def build_index(genome, genome_index, annotation='', overhang=100, overhang_min=
args = [
'STAR',
'--runThreadN', '{:d}'.format(threads),
'--genomeSAsparseD', '{:d}'.format(genome_sasparsed),
'--genomeSAindexNbases', '{:d}'.format(genome_saindexnbases),
'--runMode', 'genomeGenerate',
'--genomeDir', '{:s}'.format(genome_index),
'--genomeFastaFiles', '{:s}'.format(genome_fname2),
Expand Down

0 comments on commit 14c0421

Please sign in to comment.