Skip to content

Commit

Permalink
Exposed parameters in CLI
Browse files Browse the repository at this point in the history
  • Loading branch information
genomewalker committed Nov 26, 2024
1 parent cff7aad commit 00f8524
Show file tree
Hide file tree
Showing 3 changed files with 133 additions and 59 deletions.
134 changes: 85 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,8 @@ Full list of options:

```bash
$ filterBAM reassign --help
usage: filterBAM reassign [-h] --bam BAM [-p STR] [-r FILE] [-t INT] [-i INT] [-s FLOAT] [-A FLOAT] [-l INT] [-L INT] [-n INT] [--match-reward INT] [--mismatch-penalty INT]
[--gap-open-penalty INT] [--gap-extension-penalty INT] [--lambda FLOAT] [-k FLOAT] [-o [FILE]] [-m STR] [-M INT] [-N] [--tmp-dir DIR] [--disable-sort]
usage: filterBAM reassign [-h] --bam BAM [-p STR] [-r FILE] [-t INT] [-i INT] [-s FLOAT] [-A FLOAT] [-l INT] [-L INT] [-n INT] [--match-reward INT] [--mismatch-penalty INT] [--gap-open-penalty INT] [--gap-extension-penalty INT] [--squarem-min-improvement FLOAT] [--squarem-max-step-factor FLOAT]
[-o [FILE]] [-m STR] [-M INT] [-N] [--tmp-dir DIR] [--disable-sort]

optional arguments:
-h, --help show this help message and exit
Expand Down Expand Up @@ -116,8 +116,10 @@ Re-assign optional arguments:
Gap open penalty for alignment score computation (default: 5)
--gap-extension-penalty INT
Gap extension penalty for the alignment score (default: 2)
--lambda FLOAT Lambda parameter for the alignment score (default: 1.33)
-k FLOAT K parameter for the alignment score algorithm (default: 0.621)
--squarem-min-improvement FLOAT
Minimum relative improvement for SQUAREM convergence (default: 0.0001)
--squarem-max-step-factor FLOAT
Maximum step size multiplier for SQUAREM stability (default: 4.0)
-o [FILE], --out-bam [FILE]
Save a BAM file without multimapping reads (default: None)
-m STR, --sort-memory STR
Expand Down Expand Up @@ -348,83 +350,117 @@ filterBAM lca --bam c55d4e2df1.dedup.filtered.bam --names ./taxonomy/names.dmp -
**--scale**: Scale taxonomic abundance by this factor; suffix K/M recognized
## Read Reassignment Algorithm
# Read Reassignment Algorithm in FilterBAM
The algorithm implements a SQUAREM-accelerated Expectation-Maximization (EM) approach to resolve multi-mapping reads by iteratively refining alignment probabilities. This implementation builds on the SQUAREM method for EM acceleration (Varadhan & Roland, 2008). For each alignment, we first compute a global alignment score $S$:
The algorithm implements a SQUAREM-accelerated Expectation-Maximization (EM) approach to resolve multi-mapping reads by iteratively refining alignment probabilities. This implementation builds on the SQUAREM method for EM acceleration (Varadhan & Roland, 2008).
$$S = r_m M - p_m X - p_o G - p_e E$$
## Initial Score Calculation
where:
- $M$ is the number of matches
- $X$ is the number of mismatches
- $G$ is the number of gap openings
- $E$ is the number of gap extensions
For each alignment, we first compute a global alignment score $S$ that takes into account different aspects of the alignment quality:
$S = r_m M - p_m X - p_o G - p_e E$
Where:
- $M$ represents the number of matching bases in the alignment
- $X$ represents the number of mismatches in the alignment
- $G$ represents the number of gap openings in the alignment
- $E$ represents the number of gap extensions in the alignment
The penalties and rewards are configurable parameters:
- $r_m$ is the match reward (default: 1)
- $p_m$ is the mismatch penalty (default: -2)
- $p_o$ is the gap open penalty (default: 5)
- $p_e$ is the gap extension penalty (default: 2)
The scores are normalized through a two-step process. First, we apply a positive shift transformation:
## Score Normalization
The scores undergo a two-step normalization process:
1. First, a positive shift transformation ensures all scores are positive:
$S' = S - \min(S) + 1$
2. Then, length normalization accounts for different alignment lengths:
$S'' = \frac{S'}{L}$
where $L$ is the alignment length
These normalized scores are used to initialize the probability distribution $P(r_i|g_j)$ of read $r_i$ originating from genome $g_j$:
$$S' = S - \min(S) + 1$$
$P(r_i|g_j) = \frac{S''\_{ij}}{\sum_{k} S''\_{ik}}$
followed by length normalization:
## SQUAREM Acceleration Framework
$$S'' = \frac{S'}{L}$$
The algorithm uses SQUAREM to accelerate convergence of the EM process:
where $L$ is the alignment length.
### 1. E-step (First Order)
These normalized scores initialize the probability distribution $P(r_i|g_j)$ of read $r_i$ originating from genome $g_j$:
Calculate first EM evaluation:
$$P(r_i|g_j) = \frac{S''\_{ij}}{\sum_{k} S''\_{ik}}$$
$q_1 = M(x_k)$
The SQUAREM acceleration framework then proceeds as follows:
$r = q_1 - x_k$
1. **E-step**: Calculate subject weights
First EM evaluation:
$$q_1 = M(x_k)$$
Calculate second EM evaluation:
$$r = q_1 - x_k$$
$q_2 = M(x_k + r)$
Second EM evaluation:
$$q_2 = M(x_k + r)$$
$$v = q_2 - (x_k + r)$$
$v = q_2 - (x_k + r)$
3. **M-step**: Compute stabilized step length
Where:
- $x_k$ is the current estimate
- $M()$ is the EM mapping function
- $r$ represents the first-order difference
- $v$ represents the second-order difference
$$\alpha = -\frac{\|r\|}{\|v - r\|}$$
### 2. M-step (Step Length Calculation)
$$\alpha = \text{clip}(\alpha, -\alpha_{\text{max}}, -\frac{1}{\alpha_{\text{max}}})$$
Calculate the stabilized step length:
where $\alpha_{\text{max}}$ is the maximum step factor (default: 4.0)
$\alpha = -\frac{\|r\|}{\|v - r\|}$
4. **SQUAREM update**: Calculate new probabilities
Apply clipping to ensure stability:
$$x_{k+1} = x_k - 2\alpha r + \alpha^2v$$
$\alpha = \text{clip}(\alpha, -\alpha_{\text{max}}, -\frac{1}{\alpha_{\text{max}}})$
If the update produces invalid probabilities, fall back to basic EM step:
Where:
- $\alpha_{\text{max}}$ is the maximum step factor (default: 4.0)
- $\|r\|$ represents the norm of the first-order difference
- $\|v - r\|$ represents the norm of the difference between second and first-order differences
$$x_{k+1} = q_1$$
### 3. SQUAREM Update
6. **Assignment step**: For each read $r_i$, calculate:
Calculate new probabilities:
$$P_{\text{max}}(r_i) = \max_j P''(r_i|g_j)$$
Retain alignments satisfying:
$x_{k+1} = x_k - 2\alpha r + \alpha^2v$
$$P''(r_i|g_j) \geq \beta \cdot P_{\text{max}}(r_i)$$
If the update produces invalid probabilities (e.g., negative values or values > 1), the algorithm falls back to a basic EM step:
$x_{k+1} = q_1$
### 4. Assignment Step
For each read $r_i$:
1. Calculate maximum probability:
$P_{\text{max}}(r_i) = \max_j P''(r_i|g_j)$
2. Retain alignments that satisfy:
$P''(r_i|g_j) \geq \beta \cdot P_{\text{max}}(r_i)$
where $\beta$ is a scaling factor (default 0.9)
The algorithm iterates until convergence, defined by one of these criteria:
- Relative likelihood improvement < $\epsilon$ (default $10^{-4}$)
- Maximum iterations reached (if specified)
- Complete resolution of multi-mapping reads
- No further alignments can be removed
## Convergence Criteria
The algorithm iterates until one of these conditions is met:
1. Relative likelihood improvement < $\epsilon$ (default $10^{-4}$)
2. Maximum iterations reached (if specified)
3. Complete resolution of multi-mapping reads
4. No further alignments can be removed
### Applications and recommendations
Expand Down
10 changes: 8 additions & 2 deletions bam_filter/reassign.py
Original file line number Diff line number Diff line change
Expand Up @@ -404,9 +404,9 @@ def squarem_resolve_multimaps(
iters=10,
mmap_dir=None,
max_memory=None,
min_improvement=1e-4,
threads=None,
max_step_factor=4.0, # Maximum step size multiplier for stability
min_improvement=1e-4,
max_step_factor=4.0,
):
"""SQUAREM-accelerated multimapping resolution with improved convergence.
Expand Down Expand Up @@ -933,6 +933,8 @@ def reassign_reads(
disable_sort=False,
tmp_dir=None,
max_memory=None,
squarem_min_improvement=1e-4,
squarem_max_step_factor=4.0,
):

p_threads, s_threads = allocate_threads(threads, 1, 4)
Expand Down Expand Up @@ -1208,6 +1210,8 @@ def reassign_reads(
mmap_dir=tmp_dir.name,
max_memory=total_memory,
threads=threads,
min_improvement=squarem_min_improvement,
max_step_factor=squarem_max_step_factor,
)

n_reads = len(list(set(no_multimaps["source"])))
Expand Down Expand Up @@ -1356,6 +1360,8 @@ def reassign(args):
gap_extension_penalty=args.gap_extension_penalty,
disable_sort=args.disable_sort,
tmp_dir=tmp_dir,
squarem_min_improvement=args.squarem_min_improvement,
squarem_max_step_factor=args.squarem_max_step_factor,
)

log.info("Done!")
48 changes: 40 additions & 8 deletions bam_filter/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -547,6 +547,8 @@ def check_lca_ranks(val, parser, var):
"reassign_gap_extension_penalty": 2,
"rank_lca": "species",
"lca_summary": None,
"squarem_min_improvement": 1e-4,
"squarem_max_step_factor": 4.0,
}

help_msg = {
Expand Down Expand Up @@ -606,6 +608,8 @@ def check_lca_ranks(val, parser, var):
"custom": "Use custom taxdump files",
"version": "Print program version",
"max_memory": "Maximum memory to use for the EM algorithm",
"squarem_min_improvement": "Minimum relative improvement for SQUAREM convergence",
"squarem_max_step_factor": "Maximum step size multiplier for SQUAREM stability",
}

from difflib import get_close_matches, SequenceMatcher
Expand Down Expand Up @@ -655,15 +659,12 @@ def _get_available_arguments(self, parser=None):
for action in parser._actions:
if action.option_strings:
arguments.extend(action.option_strings)
return sorted(arguments) # Sort for consistent output

def print_help(self, file=None):
pass
return sorted(arguments)

def _print_message(self, message, file=None):
if message and "error:" in message:
if message:
if file is None:
file = sys.stderr
file = sys.stderr if "error:" in message else sys.stdout
file.write(message)

def _clean_error_message(self, message):
Expand Down Expand Up @@ -697,8 +698,6 @@ def error(self, message):
if arg.startswith("-"):
flag = arg.split("=")[0]
if flag not in available_args:
# Group similar arguments together
parts = flag.lstrip("-").split("-")
suggestion = self._get_close_matches(
flag, available_args, n=3, cutoff=0.65
)
Expand Down Expand Up @@ -966,6 +965,39 @@ def get_arguments(argv=None):
dest="gap_extension_penalty",
help=help_msg["reassign_gap_extension_penalty"],
)
reassign_optional_args.add_argument(
"--squarem-min-improvement",
type=lambda x: float(
check_values(
x,
minval=1e-10,
maxval=1.0,
parser=parser,
var="--squarem-min-improvement",
)
),
default=defaults["squarem_min_improvement"],
metavar="FLOAT",
dest="squarem_min_improvement",
help=help_msg["squarem_min_improvement"],
)

reassign_optional_args.add_argument(
"--squarem-max-step-factor",
type=lambda x: float(
check_values(
x,
minval=1.0,
maxval=10.0,
parser=parser,
var="--squarem-max-step-factor",
)
),
default=defaults["squarem_max_step_factor"],
metavar="FLOAT",
dest="squarem_max_step_factor",
help=help_msg["squarem_max_step_factor"],
)
reassign_optional_args.add_argument(
"-o",
"--out-bam",
Expand Down

0 comments on commit 00f8524

Please sign in to comment.