Merge pull request #23 from genomewalker/bitscore

genomewalker · Mar 1, 2024 · 76a0e26 · 76a0e26
2 parents 95c712a + b7b6e95
commit 76a0e26
Show file tree

Hide file tree

Showing 4 changed files with 188 additions and 28 deletions.
diff --git a/README.md b/README.md
@@ -6,18 +6,24 @@
 
 
 A simple tool to process a BAM file and filter references with uneven coverages and estimate taxonomic abundances. FilterBAM has three main goals:
-- Reassign reads to the reference they belong using an E-M algorithm that takes into account the read length and the alignment score. The alignment score is calculated using the following equation:
+1. Reassign reads to the reference they belong using an E-M algorithm that takes into account the alignment score. The alignment score is calculated using the same equation than the [BLAST bit score](https://www.ncbi.nlm.nih.gov/books/NBK62051/) using the information available in the BAM file. The alignment score is calculated as follows:
+
+   $$
+   \begin{align*}
+   &\hspace{15pt}\text{Alignment Score} = \frac{\lambda \times S - \log(K)}{\log(2)} \\
+   &\hspace{15pt}\text{where:} \\
+   &\hspace{15pt}S = (\text{Number of matches} \times \text{Match reward}) - (\text{Number of mismatches} \times \text{Mismatch penalty}) \\
+   &\hspace{35pt} - (\text{Number of gaps} \times \text{Gap open penalty}) - (\text{Gap extensions} \times \text{Gap extension penalty})
+   \end{align*}
+   $$
 
-  $$\text{{Alignment Score}} = \left( \frac{{\text{{Identity}}}}{{\log(\text{{Read Length}})}} \right) \times \sqrt{\text{{Read Length}}}$$
+    * Number of gaps and gap extensions are obtained from the BAM tags **XG** and **XO** if present.
+    * Where **Match reward** is the score for a match, **Mismatch penalty** is the score for a mismatch, **Gap open penalty** is the score for opening a gap, and **Gap extension penalty** is the score for extending a gap.
+    * **lambda** and **K** are parameters dependent upon the scoring system (substitution matrix and gap costs) employed. Check [here](https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/algo/blast/core/blast_stat.c) if you want use a different scoring system.
 
-  Where:
-  - `Identity` is the percentage identity of the alignment.
-  - `Read Length` is the length of the read.
+2. Estimate several metrics for each reference in the BAM file and filter those references that do not meet the defined criteria.
+3. Perform an LCA analysis using the reads that passed the filtering criteria and estimate the taxonomic abundances of each rank by normalizing the number of reads by the length of the reference. We resolve to the most likely reference using a likelihood-based approach. The likelihood is calculated for potential taxonomic paths from the partial taxonomic assignment of each read to its descendants in the taxonomy tree. The paths are ranked based on their likelihood, and the most probable reference is selected for each partial rank. It also can use the TAD (Truncated Average Depth) estimated reads for the LCA analysis, so it can minimize the effect of uneven coverages across the reference.
 
-  This equation incorporates both the percentage identity and the length of the read, with the logarithmic and square root components contributing to the weighting of these factors in the final alignment score. The logarithmic component is used to reduce the impact of the read length in the alignment score, while the square root component is used to increase the impact of the percentage identity.
-
-- Estimate several metrics for each reference in the BAM file and filter those references that do not meet the defined criteria.
-- Perform an LCA analysis using the reads that passed the filtering criteria and estimate the taxonomic abundances of each rank by normalizing the number of reads by the length of the reference. We resolve to the most likely reference using a likelihood-based approach. The likelihood is calculated for potential taxonomic paths from the partial taxonomic assignment of each read to its descendants in the taxonomy tree. The paths are ranked based on their likelihood, and the most probable reference is selected for each partial rank. It also can use the TAD (Truncated Average Depth) estimated reads for the LCA analysis, so it can minimize the effect of uneven coverages across the reference.
 
 # Installation
 
@@ -38,24 +44,18 @@ Then we proceed to install using pip:
 pip install bam-filter
 ```
 
-### Using conda
-
-```bash
-conda install -c conda-forge -c bioconda -c genomewalker bam-filter
-```
-
 ### Install from source to use the development version
 
 Using pip
 
 ```bash
-pip install git+ssh://git@github.com/genomewalker/bam-filter.git
+pip install git+https://github.com/genomewalker/bam-filter.git
 ```
 
 By cloning in a dedicated conda environment
 
 ```bash
-git clone git@github.com:genomewalker/bam-filter.git
+git clone https://github.com/genomewalker/bam-filter.git
 cd bam-filter
 conda env create -f environment.yml
 conda activate bam-filter
@@ -94,7 +94,10 @@ Full list of options:
 ```bash
 $ filterBAM reassign --help
 usage: filterBAM reassign [-h] --bam BAM [-p STR] [-r FILE] [-t INT] [-i INT] [-s FLOAT]
-                          [-A FLOAT] [-l INT] [-n INT] [-o [FILE]] [-m STR] [-N] [--tmp-dir DIR]
+                          [-A FLOAT] [-l INT] [-n INT] [--match-reward INT]
+                          [--mismatch-penalty INT] [--gap-open-penalty INT]
+                          [--gap-extension-penalty INT] [--lambda FLOAT] [-k FLOAT] [-o [FILE]]
+                          [-m STR] [-N] [--tmp-dir DIR]
 
 optional arguments:
   -h, --help            show this help message and exit
@@ -117,6 +120,15 @@ Re-assign optional arguments:
                         Minimum read length (default: 30)
   -n INT, --min-read-count INT
                         Minimum read count (default: 3)
+  --match-reward INT    Match reward for the alignment score (default: 1)
+  --mismatch-penalty INT
+                        Mismatch penalty for the alignment score (default: -2)
+  --gap-open-penalty INT
+                        Gap open penalty for alignment score computation (default: 5)
+  --gap-extension-penalty INT
+                        Gap extension penalty for the alignment score (default: 2)
+  --lambda FLOAT        Lambda parameter for the alignment score (default: 1.33)
+  -k FLOAT              K parameter for the alignment score algorithm (default: 0.621)
   -o [FILE], --out-bam [FILE]
                         Save a BAM file without multimapping reads (default: None)
   -m STR, --sort-memory STR

diff --git a/bam_filter/lca.py b/bam_filter/lca.py
@@ -257,7 +257,9 @@ def get_taxonomy_info(refids, taxdb, acc2taxid, nprocs=1, custom=False):
     else:
         cols = ["accession.version", "taxid"]
         name = "accession.version"
-
+    dt.options.progress.enabled = True
+    dt.options.progress.clear_on_success = True
+    dt.options.nthreads = nprocs
     filtered_df = dt.fread(acc2taxid, verbose=False, nthreads=nprocs)
     filt = dt.Frame(refids, names=[name])
     filt["FOO"] = 1

diff --git a/bam_filter/reassign.py b/bam_filter/reassign.py
@@ -55,7 +55,7 @@ def resolve_multimaps(data, scale=0.9, iters=10):
         progress_bar = tqdm.tqdm(
             total=9,
             desc=f"Iter {current_iter + 1}",
-            unit="step",
+            unit=" step",
             disable=False,  # Replace with your logic or a boolean value
             leave=False,
             ncols=80,
@@ -422,11 +422,44 @@ def write_reassigned_bam(
     os.remove(out_files["bam_reassigned_tmp"])
 
 
-def calculate_alignment_score(identity, read_length):
-    return (identity / math.log(read_length)) * math.sqrt(read_length)
-
-
-def get_bam_data(parms, ref_lengths=None, percid=90, min_read_length=30, threads=1):
+# def calculate_alignment_score(identity, read_length):
+#     return (identity / math.log(read_length)) * math.sqrt(read_length)
+
+# Values from:
+# https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/algo/blast/core/blast_stat.c
+def calculate_alignment_score(
+        num_matches, 
+        num_mismatches, 
+        num_gaps, 
+        gap_extensions, 
+        match_reward,
+        mismatch_penalty, 
+        gap_open_penalty, 
+        gap_extension_penalty, 
+        lambda_value, 
+        K_value
+    ):
+    # Calculate the raw alignment score
+    S = (num_matches * match_reward) - (num_mismatches * mismatch_penalty) - (num_gaps * gap_open_penalty) - (gap_extensions * gap_extension_penalty)
+
+    # Calculate the approximate bit score
+    bit_score = (lambda_value * S - math.log(K_value)) / math.log(2)
+
+    return bit_score
+
+def get_bam_data(
+        parms, 
+        ref_lengths=None,
+        percid=90, 
+        min_read_length=30, 
+        threads=1, 
+        match_reward=1, 
+        mismatch_penalty=-1, 
+        gap_open_penalty=1, 
+        gap_extension_penalty=2, 
+        lambda_value=1.02, 
+        K_value=0.21
+    ):
     bam, references = parms
     dt.options.progress.enabled = True
     dt.options.progress.clear_on_success = True
@@ -464,8 +497,12 @@ def get_bam_data(parms, ref_lengths=None, percid=90, min_read_length=30, threads
                 continue
 
             pident = (1 - ((aln.get_tag("NM") / query_length))) * 100
+            num_mismatches = aln.get_tag("NM")
+            num_matches = query_length - num_mismatches
+            num_gaps = aln.get_tag("XO") if aln.has_tag("XO") else 0
+            gap_extensions = aln.get_tag("XG") if aln.has_tag("XG") else 0
             if pident >= percid:
-                var = calculate_alignment_score(pident, query_length)
+                var = calculate_alignment_score(num_matches=num_matches, num_mismatches=num_mismatches, num_gaps=num_gaps, gap_extensions=gap_extensions, match_reward=match_reward, mismatch_penalty=mismatch_penalty, gap_open_penalty=gap_open_penalty, gap_extension_penalty=gap_extension_penalty, lambda_value=lambda_value, K_value=K_value)
                 aln_data.append(
                     (
                         aln.query_name,
@@ -498,6 +535,12 @@ def get_bam_data(parms, ref_lengths=None, percid=90, min_read_length=30, threads
 def reassign_reads(
     bam,
     out_files,
+    match_reward,
+    mismatch_penalty,
+    gap_open_penalty,
+    gap_extension_penalty,
+    lambda_value,
+    K_value,
     reference_lengths=None,
     threads=1,
     min_read_count=1,
@@ -586,7 +629,14 @@ def reassign_reads(
                         ref_lengths=ref_len_dict,
                         percid=min_read_ani,
                         min_read_length=min_read_length,
+                        match_reward=match_reward,
+                        mismatch_penalty=mismatch_penalty,
+                        gap_open_penalty=gap_open_penalty,
+                        gap_extension_penalty=gap_extension_penalty,
+                        lambda_value=lambda_value,
+                        K_value=K_value,
                         threads=4,
+
                     ),
                     parms,
                     chunksize=1,
@@ -608,7 +658,15 @@ def reassign_reads(
                         get_bam_data,
                         ref_lengths=ref_len_dict,
                         percid=min_read_ani,
+                        min_read_length=min_read_length,
+                        match_reward=match_reward,
+                        mismatch_penalty=mismatch_penalty,
+                        gap_open_penalty=gap_open_penalty,
+                        gap_extension_penalty=gap_extension_penalty,
+                        lambda_value=lambda_value,
+                        K_value=K_value,
                         threads=4,
+
                     ),
                     parms,
                     chunksize=1,
@@ -904,5 +962,11 @@ def reassign(args):
         sort_memory=args.sort_memory,
         sort_by_name=args.sort_by_name,
         out_files=out_files,
+        match_reward=args.match_reward,
+        mismatch_penalty=args.mismatch_penalty,
+        gap_open_penalty=args.gap_open_penalty,
+        gap_extension_penalty=args.gap_extension_penalty,
+        lambda_value=args.lambda_value,
+        K_value=args.K_value,
     )
     log.info("Done!")
diff --git a/bam_filter/utils.py b/bam_filter/utils.py
@@ -415,7 +415,12 @@ def check_lca_ranks(val, parser, var):
     "tmp_dir": None,
     "reassign_iters": 25,
     "reassign_scale": 0.9,
-    "reassign_target": "mapq",
+    "reassign_match_reward": 1,
+    "reassign_mismatch_penalty": -2,
+    "reassign_gap_open_penalty": 5,
+    "reassign_gap_extension_penalty": 2,
+    "reassign_lambda": 1.33,
+    "reassign_k": 0.621,
     "rank_lca": "species",
     "lca_summary": None,
 }
@@ -461,7 +466,12 @@ def check_lca_ranks(val, parser, var):
     "reassign_method": "Method for the EM algorithm",
     "reassign_iters": "Number of iterations for the EM algorithm",
     "reassign_scale": "Scale to select the best weithing alignments",
-    "reassign_target": "Which target to use for the EM algorith, Only mapq or pident.",
+    "reassign_match_reward": "Match reward for the alignment score ",
+    "reassign_mismatch_penalty": "Mismatch penalty for the alignment score ",
+    "reassign_gap_open_penalty": "Gap open penalty for alignment score computation",
+    "reassign_gap_extension_penalty": "Gap extension penalty for the alignment score",
+    "reassign_lambda": "Lambda parameter for the alignment score",
+    "reassign_k": "K parameter for the alignment score algorithm",
     "lca": "Calculate LCA for each read and estimate abundances",
     "names": "Names dmp file from taxonomy",
     "nodes": "Nodes dmp file from taxonomy",
@@ -658,6 +668,78 @@ def get_arguments(argv=None):
         dest="min_read_count",
         help=help_msg["min_read_count"],
     )
+    reassign_optional_args.add_argument(
+        "--match-reward",
+        type=lambda x: int(
+            check_values(
+                x, minval=0, maxval=np.Inf, parser=parser, var="--match-reward"
+            )
+        ),
+        default=defaults["reassign_match_reward"],
+        metavar="INT",
+        dest="match_reward",
+        help=help_msg["reassign_match_reward"],
+    )
+    reassign_optional_args.add_argument(
+        "--mismatch-penalty",
+        type=lambda x: int(
+            check_values(
+                x, minval=-np.Inf, maxval=0, parser=parser, var="--mismatch-penalty"
+            )
+        ),
+        default=defaults["reassign_mismatch_penalty"],
+        metavar="INT",
+        dest="mismatch_penalty",
+        help=help_msg["reassign_mismatch_penalty"],
+    )
+    reassign_optional_args.add_argument(
+        "--gap-open-penalty",
+        type=lambda x: int(
+            check_values(
+                x, minval=0, maxval=np.Inf, parser=parser, var="--gap-open-penalty"
+            )
+        ),
+        default=defaults["reassign_gap_open_penalty"],
+        metavar="INT",
+        dest="gap_open_penalty",
+        help=help_msg["reassign_gap_open_penalty"],
+    )
+    reassign_optional_args.add_argument(
+        "--gap-extension-penalty",
+        type=lambda x: int(
+            check_values(
+                x, minval=0, maxval=np.Inf, parser=parser, var="--gap-extension-penalty"
+            )
+        ),
+        default=defaults["reassign_gap_extension_penalty"],
+        metavar="INT",
+        dest="gap_extension_penalty",
+        help=help_msg["reassign_gap_extension_penalty"],
+    )
+    reassign_optional_args.add_argument(
+        "--lambda",
+        type=lambda x: float(
+            check_values(
+                x, minval=0, maxval=np.Inf, parser=parser, var="--lambda"
+            )
+        ),
+        default=defaults["reassign_lambda"],
+        metavar="FLOAT",
+        dest="lambda_value",
+        help=help_msg["reassign_lambda"],
+    )
+    reassign_optional_args.add_argument(
+        "-k",
+        type=lambda x: float(
+            check_values(
+                x, minval=0, maxval=np.Inf, parser=parser, var="-K"
+            )
+        ),
+        default=defaults["reassign_k"],
+        metavar="FLOAT",
+        dest="K_value",
+        help=help_msg["reassign_k"],
+    )
     reassign_optional_args.add_argument(
         "-o",
         "--out-bam",