Skip to content

Commit

Permalink
Merge pull request #23 from genomewalker/bitscore
Browse files Browse the repository at this point in the history
  • Loading branch information
genomewalker authored Mar 1, 2024
2 parents 95c712a + b7b6e95 commit 76a0e26
Show file tree
Hide file tree
Showing 4 changed files with 188 additions and 28 deletions.
48 changes: 30 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,24 @@


A simple tool to process a BAM file and filter references with uneven coverages and estimate taxonomic abundances. FilterBAM has three main goals:
- Reassign reads to the reference they belong using an E-M algorithm that takes into account the read length and the alignment score. The alignment score is calculated using the following equation:
1. Reassign reads to the reference they belong using an E-M algorithm that takes into account the alignment score. The alignment score is calculated using the same equation than the [BLAST bit score](https://www.ncbi.nlm.nih.gov/books/NBK62051/) using the information available in the BAM file. The alignment score is calculated as follows:

$$
\begin{align*}
&\hspace{15pt}\text{Alignment Score} = \frac{\lambda \times S - \log(K)}{\log(2)} \\
&\hspace{15pt}\text{where:} \\
&\hspace{15pt}S = (\text{Number of matches} \times \text{Match reward}) - (\text{Number of mismatches} \times \text{Mismatch penalty}) \\
&\hspace{35pt} - (\text{Number of gaps} \times \text{Gap open penalty}) - (\text{Gap extensions} \times \text{Gap extension penalty})
\end{align*}
$$

$$\text{{Alignment Score}} = \left( \frac{{\text{{Identity}}}}{{\log(\text{{Read Length}})}} \right) \times \sqrt{\text{{Read Length}}}$$
* Number of gaps and gap extensions are obtained from the BAM tags **XG** and **XO** if present.
* Where **Match reward** is the score for a match, **Mismatch penalty** is the score for a mismatch, **Gap open penalty** is the score for opening a gap, and **Gap extension penalty** is the score for extending a gap.
* **lambda** and **K** are parameters dependent upon the scoring system (substitution matrix and gap costs) employed. Check [here](https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/algo/blast/core/blast_stat.c) if you want use a different scoring system.

Where:
- `Identity` is the percentage identity of the alignment.
- `Read Length` is the length of the read.
2. Estimate several metrics for each reference in the BAM file and filter those references that do not meet the defined criteria.
3. Perform an LCA analysis using the reads that passed the filtering criteria and estimate the taxonomic abundances of each rank by normalizing the number of reads by the length of the reference. We resolve to the most likely reference using a likelihood-based approach. The likelihood is calculated for potential taxonomic paths from the partial taxonomic assignment of each read to its descendants in the taxonomy tree. The paths are ranked based on their likelihood, and the most probable reference is selected for each partial rank. It also can use the TAD (Truncated Average Depth) estimated reads for the LCA analysis, so it can minimize the effect of uneven coverages across the reference.

This equation incorporates both the percentage identity and the length of the read, with the logarithmic and square root components contributing to the weighting of these factors in the final alignment score. The logarithmic component is used to reduce the impact of the read length in the alignment score, while the square root component is used to increase the impact of the percentage identity.

- Estimate several metrics for each reference in the BAM file and filter those references that do not meet the defined criteria.
- Perform an LCA analysis using the reads that passed the filtering criteria and estimate the taxonomic abundances of each rank by normalizing the number of reads by the length of the reference. We resolve to the most likely reference using a likelihood-based approach. The likelihood is calculated for potential taxonomic paths from the partial taxonomic assignment of each read to its descendants in the taxonomy tree. The paths are ranked based on their likelihood, and the most probable reference is selected for each partial rank. It also can use the TAD (Truncated Average Depth) estimated reads for the LCA analysis, so it can minimize the effect of uneven coverages across the reference.

# Installation

Expand All @@ -38,24 +44,18 @@ Then we proceed to install using pip:
pip install bam-filter
```

### Using conda

```bash
conda install -c conda-forge -c bioconda -c genomewalker bam-filter
```

### Install from source to use the development version

Using pip

```bash
pip install git+ssh://git@github.com/genomewalker/bam-filter.git
pip install git+https://github.com/genomewalker/bam-filter.git
```

By cloning in a dedicated conda environment

```bash
git clone git@github.com:genomewalker/bam-filter.git
git clone https://github.com/genomewalker/bam-filter.git
cd bam-filter
conda env create -f environment.yml
conda activate bam-filter
Expand Down Expand Up @@ -94,7 +94,10 @@ Full list of options:
```bash
$ filterBAM reassign --help
usage: filterBAM reassign [-h] --bam BAM [-p STR] [-r FILE] [-t INT] [-i INT] [-s FLOAT]
[-A FLOAT] [-l INT] [-n INT] [-o [FILE]] [-m STR] [-N] [--tmp-dir DIR]
[-A FLOAT] [-l INT] [-n INT] [--match-reward INT]
[--mismatch-penalty INT] [--gap-open-penalty INT]
[--gap-extension-penalty INT] [--lambda FLOAT] [-k FLOAT] [-o [FILE]]
[-m STR] [-N] [--tmp-dir DIR]

optional arguments:
-h, --help show this help message and exit
Expand All @@ -117,6 +120,15 @@ Re-assign optional arguments:
Minimum read length (default: 30)
-n INT, --min-read-count INT
Minimum read count (default: 3)
--match-reward INT Match reward for the alignment score (default: 1)
--mismatch-penalty INT
Mismatch penalty for the alignment score (default: -2)
--gap-open-penalty INT
Gap open penalty for alignment score computation (default: 5)
--gap-extension-penalty INT
Gap extension penalty for the alignment score (default: 2)
--lambda FLOAT Lambda parameter for the alignment score (default: 1.33)
-k FLOAT K parameter for the alignment score algorithm (default: 0.621)
-o [FILE], --out-bam [FILE]
Save a BAM file without multimapping reads (default: None)
-m STR, --sort-memory STR
Expand Down
4 changes: 3 additions & 1 deletion bam_filter/lca.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,9 @@ def get_taxonomy_info(refids, taxdb, acc2taxid, nprocs=1, custom=False):
else:
cols = ["accession.version", "taxid"]
name = "accession.version"

dt.options.progress.enabled = True
dt.options.progress.clear_on_success = True
dt.options.nthreads = nprocs
filtered_df = dt.fread(acc2taxid, verbose=False, nthreads=nprocs)
filt = dt.Frame(refids, names=[name])
filt["FOO"] = 1
Expand Down
78 changes: 71 additions & 7 deletions bam_filter/reassign.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ def resolve_multimaps(data, scale=0.9, iters=10):
progress_bar = tqdm.tqdm(
total=9,
desc=f"Iter {current_iter + 1}",
unit="step",
unit=" step",
disable=False, # Replace with your logic or a boolean value
leave=False,
ncols=80,
Expand Down Expand Up @@ -422,11 +422,44 @@ def write_reassigned_bam(
os.remove(out_files["bam_reassigned_tmp"])


def calculate_alignment_score(identity, read_length):
return (identity / math.log(read_length)) * math.sqrt(read_length)


def get_bam_data(parms, ref_lengths=None, percid=90, min_read_length=30, threads=1):
# def calculate_alignment_score(identity, read_length):
# return (identity / math.log(read_length)) * math.sqrt(read_length)

# Values from:
# https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/algo/blast/core/blast_stat.c
def calculate_alignment_score(
num_matches,
num_mismatches,
num_gaps,
gap_extensions,
match_reward,
mismatch_penalty,
gap_open_penalty,
gap_extension_penalty,
lambda_value,
K_value
):
# Calculate the raw alignment score
S = (num_matches * match_reward) - (num_mismatches * mismatch_penalty) - (num_gaps * gap_open_penalty) - (gap_extensions * gap_extension_penalty)

# Calculate the approximate bit score
bit_score = (lambda_value * S - math.log(K_value)) / math.log(2)

return bit_score

def get_bam_data(
parms,
ref_lengths=None,
percid=90,
min_read_length=30,
threads=1,
match_reward=1,
mismatch_penalty=-1,
gap_open_penalty=1,
gap_extension_penalty=2,
lambda_value=1.02,
K_value=0.21
):
bam, references = parms
dt.options.progress.enabled = True
dt.options.progress.clear_on_success = True
Expand Down Expand Up @@ -464,8 +497,12 @@ def get_bam_data(parms, ref_lengths=None, percid=90, min_read_length=30, threads
continue

pident = (1 - ((aln.get_tag("NM") / query_length))) * 100
num_mismatches = aln.get_tag("NM")
num_matches = query_length - num_mismatches
num_gaps = aln.get_tag("XO") if aln.has_tag("XO") else 0
gap_extensions = aln.get_tag("XG") if aln.has_tag("XG") else 0
if pident >= percid:
var = calculate_alignment_score(pident, query_length)
var = calculate_alignment_score(num_matches=num_matches, num_mismatches=num_mismatches, num_gaps=num_gaps, gap_extensions=gap_extensions, match_reward=match_reward, mismatch_penalty=mismatch_penalty, gap_open_penalty=gap_open_penalty, gap_extension_penalty=gap_extension_penalty, lambda_value=lambda_value, K_value=K_value)
aln_data.append(
(
aln.query_name,
Expand Down Expand Up @@ -498,6 +535,12 @@ def get_bam_data(parms, ref_lengths=None, percid=90, min_read_length=30, threads
def reassign_reads(
bam,
out_files,
match_reward,
mismatch_penalty,
gap_open_penalty,
gap_extension_penalty,
lambda_value,
K_value,
reference_lengths=None,
threads=1,
min_read_count=1,
Expand Down Expand Up @@ -586,7 +629,14 @@ def reassign_reads(
ref_lengths=ref_len_dict,
percid=min_read_ani,
min_read_length=min_read_length,
match_reward=match_reward,
mismatch_penalty=mismatch_penalty,
gap_open_penalty=gap_open_penalty,
gap_extension_penalty=gap_extension_penalty,
lambda_value=lambda_value,
K_value=K_value,
threads=4,

),
parms,
chunksize=1,
Expand All @@ -608,7 +658,15 @@ def reassign_reads(
get_bam_data,
ref_lengths=ref_len_dict,
percid=min_read_ani,
min_read_length=min_read_length,
match_reward=match_reward,
mismatch_penalty=mismatch_penalty,
gap_open_penalty=gap_open_penalty,
gap_extension_penalty=gap_extension_penalty,
lambda_value=lambda_value,
K_value=K_value,
threads=4,

),
parms,
chunksize=1,
Expand Down Expand Up @@ -904,5 +962,11 @@ def reassign(args):
sort_memory=args.sort_memory,
sort_by_name=args.sort_by_name,
out_files=out_files,
match_reward=args.match_reward,
mismatch_penalty=args.mismatch_penalty,
gap_open_penalty=args.gap_open_penalty,
gap_extension_penalty=args.gap_extension_penalty,
lambda_value=args.lambda_value,
K_value=args.K_value,
)
log.info("Done!")
86 changes: 84 additions & 2 deletions bam_filter/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -415,7 +415,12 @@ def check_lca_ranks(val, parser, var):
"tmp_dir": None,
"reassign_iters": 25,
"reassign_scale": 0.9,
"reassign_target": "mapq",
"reassign_match_reward": 1,
"reassign_mismatch_penalty": -2,
"reassign_gap_open_penalty": 5,
"reassign_gap_extension_penalty": 2,
"reassign_lambda": 1.33,
"reassign_k": 0.621,
"rank_lca": "species",
"lca_summary": None,
}
Expand Down Expand Up @@ -461,7 +466,12 @@ def check_lca_ranks(val, parser, var):
"reassign_method": "Method for the EM algorithm",
"reassign_iters": "Number of iterations for the EM algorithm",
"reassign_scale": "Scale to select the best weithing alignments",
"reassign_target": "Which target to use for the EM algorith, Only mapq or pident.",
"reassign_match_reward": "Match reward for the alignment score ",
"reassign_mismatch_penalty": "Mismatch penalty for the alignment score ",
"reassign_gap_open_penalty": "Gap open penalty for alignment score computation",
"reassign_gap_extension_penalty": "Gap extension penalty for the alignment score",
"reassign_lambda": "Lambda parameter for the alignment score",
"reassign_k": "K parameter for the alignment score algorithm",
"lca": "Calculate LCA for each read and estimate abundances",
"names": "Names dmp file from taxonomy",
"nodes": "Nodes dmp file from taxonomy",
Expand Down Expand Up @@ -658,6 +668,78 @@ def get_arguments(argv=None):
dest="min_read_count",
help=help_msg["min_read_count"],
)
reassign_optional_args.add_argument(
"--match-reward",
type=lambda x: int(
check_values(
x, minval=0, maxval=np.Inf, parser=parser, var="--match-reward"
)
),
default=defaults["reassign_match_reward"],
metavar="INT",
dest="match_reward",
help=help_msg["reassign_match_reward"],
)
reassign_optional_args.add_argument(
"--mismatch-penalty",
type=lambda x: int(
check_values(
x, minval=-np.Inf, maxval=0, parser=parser, var="--mismatch-penalty"
)
),
default=defaults["reassign_mismatch_penalty"],
metavar="INT",
dest="mismatch_penalty",
help=help_msg["reassign_mismatch_penalty"],
)
reassign_optional_args.add_argument(
"--gap-open-penalty",
type=lambda x: int(
check_values(
x, minval=0, maxval=np.Inf, parser=parser, var="--gap-open-penalty"
)
),
default=defaults["reassign_gap_open_penalty"],
metavar="INT",
dest="gap_open_penalty",
help=help_msg["reassign_gap_open_penalty"],
)
reassign_optional_args.add_argument(
"--gap-extension-penalty",
type=lambda x: int(
check_values(
x, minval=0, maxval=np.Inf, parser=parser, var="--gap-extension-penalty"
)
),
default=defaults["reassign_gap_extension_penalty"],
metavar="INT",
dest="gap_extension_penalty",
help=help_msg["reassign_gap_extension_penalty"],
)
reassign_optional_args.add_argument(
"--lambda",
type=lambda x: float(
check_values(
x, minval=0, maxval=np.Inf, parser=parser, var="--lambda"
)
),
default=defaults["reassign_lambda"],
metavar="FLOAT",
dest="lambda_value",
help=help_msg["reassign_lambda"],
)
reassign_optional_args.add_argument(
"-k",
type=lambda x: float(
check_values(
x, minval=0, maxval=np.Inf, parser=parser, var="-K"
)
),
default=defaults["reassign_k"],
metavar="FLOAT",
dest="K_value",
help=help_msg["reassign_k"],
)
reassign_optional_args.add_argument(
"-o",
"--out-bam",
Expand Down

0 comments on commit 76a0e26

Please sign in to comment.