star: Add optional arguments to control memory usage, related to #134

With arguments genome_sasparsed and genome_saindexnbases one can control STAR's memory requirements and usage.
tomazc · Sep 20, 2017 · 14c0421 · 14c0421
1 parent 915c976
commit 14c0421
Show file tree

Hide file tree

Showing 5 changed files with 42 additions and 16 deletions.
diff --git a/docs/source/ref_CLI.txt b/docs/source/ref_CLI.txt
@@ -241,7 +241,8 @@ indexstar
 =========
 
 usage: iCount indexstar [-h] [-a] [--overhang] [--overhang_min] [--threads]
-                        [-S] [-F] [-P] [-M]
+                        [--genome_sasparsed] [--genome_saindexnbases] [-S]
+                        [-F] [-P] [-M]
                         genome genome_index
 
 Generate STAR genome index.
@@ -255,8 +256,16 @@ optional arguments:
   -a , --annotation     Annotation that defines splice junctions (default: )
   --overhang            Sequence length around annotated junctions to be used by STAR when
                         constructing splice junction database (default: 100)
-  --overhang_min        TODO (default: 8)
+  --overhang_min        Minimum overhang for unannotated junctions (default: 8)
   --threads             Number of threads that STAR can use for generating index (default: 1)
+  --genome_sasparsed    STAR parameter genomeSAsparseD.
+                        Suffix array sparsity. Bigger numbers decrease RAM requirements
+                        at the cost of mapping speed reduction. Suggested values
+                        are 1 (30 GB RAM) or 2 (16 GB RAM) (default: 1)
+  --genome_saindexnbases 
+                        STAR parameter genomeSAindexNbases.
+                        SA pre-indexing string length, typically between 10 and 15.
+                        Longer strings require more memory, but result in faster searches (default: 14)
   -S , --stdout_log     Threshold value (0-50) for logging to stdout. If 0, logging to stdout if turned OFF.
   -F , --file_log       Threshold value (0-50) for logging to file. If 0, logging to file if turned OFF.
   -P , --file_logpath   Path to log file.

diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst
@@ -33,15 +33,15 @@ iCLIP sequencing reads must be mapped to a reference genome. The user can prepar
 Another option is to download a release from `ensembl`_. You can use the command ``releases`` to
 get a list of available releases supported by **iCount**::
 
-    $ iCount releases
+    $ iCount releases --source ensembl
 
     There are 30 releases available: 88,87,86,85,84,83,82,81,80,79,78,77,76,75,74,73,
     72,71,70,69,68,67,66,65,64,63,62,61,60,59
 
 
 You can then use the command ``species`` to get a list of species available in a release::
 
-    $ iCount species -r 88
+    $ iCount species --source ensembl -r 88
 
     There are 87 species available: ailuropoda_melanoleuca,anas_platyrhynchos,
     ancestral_alleles,anolis_carolinensis,astyanax_mexicanus,bos_taurus,
@@ -55,7 +55,7 @@ You can then use the command ``species`` to get a list of species available in a
 
 Let's download the human genome sequence from release 88::
 
-    $ iCount genome homo_sapiens -r 88 --chromosomes 21 MT
+    $ iCount genome --source ensembl homo_sapiens -r 88 --chromosomes 21 MT
 
     Downloading FASTA file into: /..././homo_sapiens.88.chr21_MT.fa.gz
     Fai file saved to : /..././iCount/homo_sapiens.88.chr21_MT.fa.gz.fai
@@ -67,7 +67,7 @@ Let's download the human genome sequence from release 88::
 
 And the annotation of the human genome from release 88::
 
-    $ iCount annotation homo_sapiens -r 88
+    $ iCount annotation --source ensembl homo_sapiens -r 88
 
     Downloading GTF to: /..././homo_sapiens.88.gtf.gz
     Done.
@@ -77,7 +77,7 @@ The next step is to generate a genome index that is used by `STAR`_ mapper. Let'
 
     $ mkdir hs88  # folder should be empty
     $ iCount indexstar homo_sapiens.88.chr21_MT.fa.gz hs88 \
-    --annotation homo_sapiens.88.gtf.gz
+    --annotation homo_sapiens.88.gtf.gz --genome_sasparsed 2 --genome_saindexnbases 13
 
     Building genome index with STAR for genome homo_sapiens.88.fa.gz
     <timestamp> ..... Started STAR run
@@ -99,6 +99,10 @@ The next step is to generate a genome index that is used by `STAR`_ mapper. Let'
     A subfolder ``hs88`` will be created in current working directory. You can specify
     alternative relative or absolute paths, e.g., ``indexes/hs88``.
 
+.. note::
+    Changing the parameters ``genome_sasparsed`` and ``genome_saindexnbases`` results into
+    lower memory requirements but longer run times.
+
 We are now ready to start mapping iCLIP data to the human genome!
 
 .. _`ensembl`:

diff --git a/iCount/cli.py b/iCount/cli.py
@@ -117,7 +117,7 @@ def _extract_parameter_data(function):
 
     Every parameter in returned object can have the following entries:
 
-        * name - the name of parameter, preceeded by '--' if it is optional
+        * name - the name of parameter, preceded by '--' if it is optional
         * default - the default value (only for optional parameters). Extracted
           from function signature.
         * type - type of parameter, extracted from function docstring. If not
@@ -391,7 +391,7 @@ def verbose_help(mode):
 
     # all_args command:
     def all_args():
-        """Print all posssible parameter names and CLI commands where they are used."""
+        """Print all possible parameter names and CLI commands where they are used."""
         for param_name, commands in sorted(PARAMETERS.items(), key=lambda x: x[0].lstrip('-')):
             if param_name in SHORT_OPTARG_NAMES:
                 short_name = ' ({})'.format(SHORT_OPTARG_NAMES[param_name])

diff --git a/iCount/examples/tutorial.sh b/iCount/examples/tutorial.sh
@@ -4,16 +4,17 @@ set -vx
 mkdir tutorial_example
 cd tutorial_example
 
-iCount releases
+iCount releases --source ensembl
 
-iCount species -r 88
+iCount species --source ensembl -r 88
 
-iCount genome homo_sapiens -r 88 --chromosomes 21 MT
+iCount genome --source ensembl homo_sapiens 88 --chromosomes 21 MT
 
-iCount annotation homo_sapiens -r 88
+iCount annotation --source ensembl homo_sapiens 88
 
 mkdir hs88
-iCount indexstar homo_sapiens.88.chr21_MT.fa.gz hs88 --annotation homo_sapiens.88.gtf.gz
+iCount indexstar homo_sapiens.88.chr21_MT.fa.gz hs88 \
+--annotation homo_sapiens.88.gtf.gz --genome_sasparsed 2 --genome_saindexnbases 13
 
 # the whole data set [880 MB] is available here:
 #wget http://icount.fri.uni-lj.si/data/20101116_LUjh03/\

diff --git a/iCount/externals/star.py b/iCount/externals/star.py
@@ -58,7 +58,8 @@ def get_version():
         return None
 
 
-def build_index(genome, genome_index, annotation='', overhang=100, overhang_min=8, threads=1):
+def build_index(genome, genome_index, annotation='', overhang=100, overhang_min=8, threads=1,
+                genome_sasparsed=1, genome_saindexnbases=14):
     """
     Call STAR to generate genome index, which is used for mapping.
 
@@ -74,9 +75,18 @@ def build_index(genome, genome_index, annotation='', overhang=100, overhang_min=
         Sequence length around annotated junctions to be used by STAR when
         constructing splice junction database.
     overhang_min : int
-        TODO
+        Minimum overhang for unannotated junctions.
     threads : int
         Number of threads that STAR can use for generating index.
+    genome_sasparsed : int
+        STAR parameter genomeSAsparseD.
+        Suffix array sparsity. Bigger numbers decrease RAM requirements
+        at the cost of mapping speed reduction. Suggested values
+        are 1 (30 GB RAM) or 2 (16 GB RAM).
+    genome_saindexnbases : int
+        STAR parameter genomeSAindexNbases.
+        SA pre-indexing string length, typically between 10 and 15.
+        Longer strings require more memory, but result in faster searches.
 
     Returns
     -------
@@ -95,6 +105,8 @@ def build_index(genome, genome_index, annotation='', overhang=100, overhang_min=
     args = [
         'STAR',
         '--runThreadN', '{:d}'.format(threads),
+        '--genomeSAsparseD', '{:d}'.format(genome_sasparsed),
+        '--genomeSAindexNbases', '{:d}'.format(genome_saindexnbases),
         '--runMode', 'genomeGenerate',
         '--genomeDir', '{:s}'.format(genome_index),
         '--genomeFastaFiles', '{:s}'.format(genome_fname2),