Exposed parameters in CLI

genomewalker · Nov 26, 2024 · 00f8524 · 00f8524
1 parent cff7aad
commit 00f8524
Show file tree

Hide file tree

Showing 3 changed files with 133 additions and 59 deletions.
diff --git a/README.md b/README.md
@@ -83,8 +83,8 @@ Full list of options:
 
 ```bash
 $ filterBAM reassign --help
-usage: filterBAM reassign [-h] --bam BAM [-p STR] [-r FILE] [-t INT] [-i INT] [-s FLOAT] [-A FLOAT] [-l INT] [-L INT] [-n INT] [--match-reward INT] [--mismatch-penalty INT]
-                          [--gap-open-penalty INT] [--gap-extension-penalty INT] [--lambda FLOAT] [-k FLOAT] [-o [FILE]] [-m STR] [-M INT] [-N] [--tmp-dir DIR] [--disable-sort]
+usage: filterBAM reassign [-h] --bam BAM [-p STR] [-r FILE] [-t INT] [-i INT] [-s FLOAT] [-A FLOAT] [-l INT] [-L INT] [-n INT] [--match-reward INT] [--mismatch-penalty INT] [--gap-open-penalty INT] [--gap-extension-penalty INT] [--squarem-min-improvement FLOAT] [--squarem-max-step-factor FLOAT]
+                          [-o [FILE]] [-m STR] [-M INT] [-N] [--tmp-dir DIR] [--disable-sort]
 
 optional arguments:
   -h, --help            show this help message and exit
@@ -116,8 +116,10 @@ Re-assign optional arguments:
                         Gap open penalty for alignment score computation (default: 5)
   --gap-extension-penalty INT
                         Gap extension penalty for the alignment score (default: 2)
-  --lambda FLOAT        Lambda parameter for the alignment score (default: 1.33)
-  -k FLOAT              K parameter for the alignment score algorithm (default: 0.621)
+  --squarem-min-improvement FLOAT
+                        Minimum relative improvement for SQUAREM convergence (default: 0.0001)
+  --squarem-max-step-factor FLOAT
+                        Maximum step size multiplier for SQUAREM stability (default: 4.0)
   -o [FILE], --out-bam [FILE]
                         Save a BAM file without multimapping reads (default: None)
   -m STR, --sort-memory STR
@@ -348,83 +350,117 @@ filterBAM lca --bam c55d4e2df1.dedup.filtered.bam --names ./taxonomy/names.dmp -
 
 **--scale**: Scale taxonomic abundance by this factor; suffix K/M recognized 
 
-## Read Reassignment Algorithm
+# Read Reassignment Algorithm in FilterBAM
 
-The algorithm implements a SQUAREM-accelerated Expectation-Maximization (EM) approach to resolve multi-mapping reads by iteratively refining alignment probabilities. This implementation builds on the SQUAREM method for EM acceleration (Varadhan & Roland, 2008). For each alignment, we first compute a global alignment score $S$:
+The algorithm implements a SQUAREM-accelerated Expectation-Maximization (EM) approach to resolve multi-mapping reads by iteratively refining alignment probabilities. This implementation builds on the SQUAREM method for EM acceleration (Varadhan & Roland, 2008).
 
-$$S = r_m M - p_m X - p_o G - p_e E$$
+## Initial Score Calculation
 
-where:
-- $M$ is the number of matches
-- $X$ is the number of mismatches
-- $G$ is the number of gap openings
-- $E$ is the number of gap extensions
+For each alignment, we first compute a global alignment score $S$ that takes into account different aspects of the alignment quality:
+
+$S = r_m M - p_m X - p_o G - p_e E$
+
+Where:
+- $M$ represents the number of matching bases in the alignment
+- $X$ represents the number of mismatches in the alignment
+- $G$ represents the number of gap openings in the alignment
+- $E$ represents the number of gap extensions in the alignment
+
+The penalties and rewards are configurable parameters:
 - $r_m$ is the match reward (default: 1)
 - $p_m$ is the mismatch penalty (default: -2)
 - $p_o$ is the gap open penalty (default: 5)
 - $p_e$ is the gap extension penalty (default: 2)
 
-The scores are normalized through a two-step process. First, we apply a positive shift transformation:
+## Score Normalization
+
+The scores undergo a two-step normalization process:
+
+1. First, a positive shift transformation ensures all scores are positive:
+
+   $S' = S - \min(S) + 1$
+
+2. Then, length normalization accounts for different alignment lengths:
+
+   $S'' = \frac{S'}{L}$
+
+   where $L$ is the alignment length
+
+These normalized scores are used to initialize the probability distribution $P(r_i|g_j)$ of read $r_i$ originating from genome $g_j$:
 
-$$S' = S - \min(S) + 1$$
+$P(r_i|g_j) = \frac{S''\_{ij}}{\sum_{k} S''\_{ik}}$
 
-followed by length normalization:
+## SQUAREM Acceleration Framework
 
-$$S'' = \frac{S'}{L}$$
+The algorithm uses SQUAREM to accelerate convergence of the EM process:
 
-where $L$ is the alignment length.
+### 1. E-step (First Order)
 
-These normalized scores initialize the probability distribution $P(r_i|g_j)$ of read $r_i$ originating from genome $g_j$:
+Calculate first EM evaluation:
 
-$$P(r_i|g_j) = \frac{S''\_{ij}}{\sum_{k} S''\_{ik}}$$
+$q_1 = M(x_k)$
 
-The SQUAREM acceleration framework then proceeds as follows:
+$r = q_1 - x_k$
 
-1. **E-step**: Calculate subject weights
-   
-   First EM evaluation:
-   
-   $$q_1 = M(x_k)$$
+Calculate second EM evaluation:
 
-   $$r = q_1 - x_k$$
+$q_2 = M(x_k + r)$
 
-   Second EM evaluation:
-   
-   $$q_2 = M(x_k + r)$$
-   
-   $$v = q_2 - (x_k + r)$$
+$v = q_2 - (x_k + r)$
 
-3. **M-step**: Compute stabilized step length
+Where:
+- $x_k$ is the current estimate
+- $M()$ is the EM mapping function
+- $r$ represents the first-order difference
+- $v$ represents the second-order difference
 
-   $$\alpha = -\frac{\|r\|}{\|v - r\|}$$
+### 2. M-step (Step Length Calculation)
 
-   $$\alpha = \text{clip}(\alpha, -\alpha_{\text{max}}, -\frac{1}{\alpha_{\text{max}}})$$
+Calculate the stabilized step length:
 
-   where $\alpha_{\text{max}}$ is the maximum step factor (default: 4.0)
+$\alpha = -\frac{\|r\|}{\|v - r\|}$
 
-4. **SQUAREM update**: Calculate new probabilities
+Apply clipping to ensure stability:
 
-   $$x_{k+1} = x_k - 2\alpha r + \alpha^2v$$
+$\alpha = \text{clip}(\alpha, -\alpha_{\text{max}}, -\frac{1}{\alpha_{\text{max}}})$
 
-   If the update produces invalid probabilities, fall back to basic EM step:
+Where:
+- $\alpha_{\text{max}}$ is the maximum step factor (default: 4.0)
+- $\|r\|$ represents the norm of the first-order difference
+- $\|v - r\|$ represents the norm of the difference between second and first-order differences
 
-   $$x_{k+1} = q_1$$
+### 3. SQUAREM Update
 
-6. **Assignment step**: For each read $r_i$, calculate:
+Calculate new probabilities:
 
-   $$P_{\text{max}}(r_i) = \max_j P''(r_i|g_j)$$
-   
-   Retain alignments satisfying:
+$x_{k+1} = x_k - 2\alpha r + \alpha^2v$
 
-   $$P''(r_i|g_j) \geq \beta \cdot P_{\text{max}}(r_i)$$
+If the update produces invalid probabilities (e.g., negative values or values > 1), the algorithm falls back to a basic EM step:
+
+$x_{k+1} = q_1$
+
+### 4. Assignment Step
+
+For each read $r_i$:
+
+1. Calculate maximum probability:
+
+   $P_{\text{max}}(r_i) = \max_j P''(r_i|g_j)$
+
+2. Retain alignments that satisfy:
+
+   $P''(r_i|g_j) \geq \beta \cdot P_{\text{max}}(r_i)$
 
    where $\beta$ is a scaling factor (default 0.9)
 
-The algorithm iterates until convergence, defined by one of these criteria:
-- Relative likelihood improvement < $\epsilon$ (default $10^{-4}$)
-- Maximum iterations reached (if specified)
-- Complete resolution of multi-mapping reads
-- No further alignments can be removed
+## Convergence Criteria
+
+The algorithm iterates until one of these conditions is met:
+
+1. Relative likelihood improvement < $\epsilon$ (default $10^{-4}$)
+2. Maximum iterations reached (if specified)
+3. Complete resolution of multi-mapping reads
+4. No further alignments can be removed
 
 ### Applications and recommendations
 

diff --git a/bam_filter/reassign.py b/bam_filter/reassign.py
@@ -404,9 +404,9 @@ def squarem_resolve_multimaps(
     iters=10,
     mmap_dir=None,
     max_memory=None,
-    min_improvement=1e-4,
     threads=None,
-    max_step_factor=4.0,  # Maximum step size multiplier for stability
+    min_improvement=1e-4,
+    max_step_factor=4.0,
 ):
     """SQUAREM-accelerated multimapping resolution with improved convergence.
 
@@ -933,6 +933,8 @@ def reassign_reads(
     disable_sort=False,
     tmp_dir=None,
     max_memory=None,
+    squarem_min_improvement=1e-4,
+    squarem_max_step_factor=4.0,
 ):
 
     p_threads, s_threads = allocate_threads(threads, 1, 4)
@@ -1208,6 +1210,8 @@ def reassign_reads(
         mmap_dir=tmp_dir.name,
         max_memory=total_memory,
         threads=threads,
+        min_improvement=squarem_min_improvement,
+        max_step_factor=squarem_max_step_factor,
     )
 
     n_reads = len(list(set(no_multimaps["source"])))
@@ -1356,6 +1360,8 @@ def reassign(args):
         gap_extension_penalty=args.gap_extension_penalty,
         disable_sort=args.disable_sort,
         tmp_dir=tmp_dir,
+        squarem_min_improvement=args.squarem_min_improvement,
+        squarem_max_step_factor=args.squarem_max_step_factor,
     )
 
     log.info("Done!")
diff --git a/bam_filter/utils.py b/bam_filter/utils.py
@@ -547,6 +547,8 @@ def check_lca_ranks(val, parser, var):
     "reassign_gap_extension_penalty": 2,
     "rank_lca": "species",
     "lca_summary": None,
+    "squarem_min_improvement": 1e-4,
+    "squarem_max_step_factor": 4.0,
 }
 
 help_msg = {
@@ -606,6 +608,8 @@ def check_lca_ranks(val, parser, var):
     "custom": "Use custom taxdump files",
     "version": "Print program version",
     "max_memory": "Maximum memory to use for the EM algorithm",
+    "squarem_min_improvement": "Minimum relative improvement for SQUAREM convergence",
+    "squarem_max_step_factor": "Maximum step size multiplier for SQUAREM stability",
 }
 
 from difflib import get_close_matches, SequenceMatcher
@@ -655,15 +659,12 @@ def _get_available_arguments(self, parser=None):
         for action in parser._actions:
             if action.option_strings:
                 arguments.extend(action.option_strings)
-        return sorted(arguments)  # Sort for consistent output
-
-    def print_help(self, file=None):
-        pass
+        return sorted(arguments)
 
     def _print_message(self, message, file=None):
-        if message and "error:" in message:
+        if message:
             if file is None:
-                file = sys.stderr
+                file = sys.stderr if "error:" in message else sys.stdout
             file.write(message)
 
     def _clean_error_message(self, message):
@@ -697,8 +698,6 @@ def error(self, message):
                                 if arg.startswith("-"):
                                     flag = arg.split("=")[0]
                                     if flag not in available_args:
-                                        # Group similar arguments together
-                                        parts = flag.lstrip("-").split("-")
                                         suggestion = self._get_close_matches(
                                             flag, available_args, n=3, cutoff=0.65
                                         )
@@ -966,6 +965,39 @@ def get_arguments(argv=None):
         dest="gap_extension_penalty",
         help=help_msg["reassign_gap_extension_penalty"],
     )
+    reassign_optional_args.add_argument(
+        "--squarem-min-improvement",
+        type=lambda x: float(
+            check_values(
+                x,
+                minval=1e-10,
+                maxval=1.0,
+                parser=parser,
+                var="--squarem-min-improvement",
+            )
+        ),
+        default=defaults["squarem_min_improvement"],
+        metavar="FLOAT",
+        dest="squarem_min_improvement",
+        help=help_msg["squarem_min_improvement"],
+    )
+
+    reassign_optional_args.add_argument(
+        "--squarem-max-step-factor",
+        type=lambda x: float(
+            check_values(
+                x,
+                minval=1.0,
+                maxval=10.0,
+                parser=parser,
+                var="--squarem-max-step-factor",
+            )
+        ),
+        default=defaults["squarem_max_step_factor"],
+        metavar="FLOAT",
+        dest="squarem_max_step_factor",
+        help=help_msg["squarem_max_step_factor"],
+    )
     reassign_optional_args.add_argument(
         "-o",
         "--out-bam",