Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new alignment score #23

Merged
merged 7 commits into from
Mar 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 30 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,24 @@


A simple tool to process a BAM file and filter references with uneven coverages and estimate taxonomic abundances. FilterBAM has three main goals:
- Reassign reads to the reference they belong using an E-M algorithm that takes into account the read length and the alignment score. The alignment score is calculated using the following equation:
1. Reassign reads to the reference they belong using an E-M algorithm that takes into account the alignment score. The alignment score is calculated using the same equation than the [BLAST bit score](https://www.ncbi.nlm.nih.gov/books/NBK62051/) using the information available in the BAM file. The alignment score is calculated as follows:

$$
\begin{align*}
&\hspace{15pt}\text{Alignment Score} = \frac{\lambda \times S - \log(K)}{\log(2)} \\
&\hspace{15pt}\text{where:} \\
&\hspace{15pt}S = (\text{Number of matches} \times \text{Match reward}) - (\text{Number of mismatches} \times \text{Mismatch penalty}) \\
&\hspace{35pt} - (\text{Number of gaps} \times \text{Gap open penalty}) - (\text{Gap extensions} \times \text{Gap extension penalty})
\end{align*}
$$

$$\text{{Alignment Score}} = \left( \frac{{\text{{Identity}}}}{{\log(\text{{Read Length}})}} \right) \times \sqrt{\text{{Read Length}}}$$
* Number of gaps and gap extensions are obtained from the BAM tags **XG** and **XO** if present.
* Where **Match reward** is the score for a match, **Mismatch penalty** is the score for a mismatch, **Gap open penalty** is the score for opening a gap, and **Gap extension penalty** is the score for extending a gap.
* **lambda** and **K** are parameters dependent upon the scoring system (substitution matrix and gap costs) employed. Check [here](https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/algo/blast/core/blast_stat.c) if you want use a different scoring system.

Where:
- `Identity` is the percentage identity of the alignment.
- `Read Length` is the length of the read.
2. Estimate several metrics for each reference in the BAM file and filter those references that do not meet the defined criteria.
3. Perform an LCA analysis using the reads that passed the filtering criteria and estimate the taxonomic abundances of each rank by normalizing the number of reads by the length of the reference. We resolve to the most likely reference using a likelihood-based approach. The likelihood is calculated for potential taxonomic paths from the partial taxonomic assignment of each read to its descendants in the taxonomy tree. The paths are ranked based on their likelihood, and the most probable reference is selected for each partial rank. It also can use the TAD (Truncated Average Depth) estimated reads for the LCA analysis, so it can minimize the effect of uneven coverages across the reference.

This equation incorporates both the percentage identity and the length of the read, with the logarithmic and square root components contributing to the weighting of these factors in the final alignment score. The logarithmic component is used to reduce the impact of the read length in the alignment score, while the square root component is used to increase the impact of the percentage identity.

- Estimate several metrics for each reference in the BAM file and filter those references that do not meet the defined criteria.
- Perform an LCA analysis using the reads that passed the filtering criteria and estimate the taxonomic abundances of each rank by normalizing the number of reads by the length of the reference. We resolve to the most likely reference using a likelihood-based approach. The likelihood is calculated for potential taxonomic paths from the partial taxonomic assignment of each read to its descendants in the taxonomy tree. The paths are ranked based on their likelihood, and the most probable reference is selected for each partial rank. It also can use the TAD (Truncated Average Depth) estimated reads for the LCA analysis, so it can minimize the effect of uneven coverages across the reference.

# Installation

Expand All @@ -38,24 +44,18 @@ Then we proceed to install using pip:
pip install bam-filter
```

### Using conda

```bash
conda install -c conda-forge -c bioconda -c genomewalker bam-filter
```

### Install from source to use the development version

Using pip

```bash
pip install git+ssh://git@github.com/genomewalker/bam-filter.git
pip install git+https://github.com/genomewalker/bam-filter.git
```

By cloning in a dedicated conda environment

```bash
git clone git@github.com:genomewalker/bam-filter.git
git clone https://github.com/genomewalker/bam-filter.git
cd bam-filter
conda env create -f environment.yml
conda activate bam-filter
Expand Down Expand Up @@ -94,7 +94,10 @@ Full list of options:
```bash
$ filterBAM reassign --help
usage: filterBAM reassign [-h] --bam BAM [-p STR] [-r FILE] [-t INT] [-i INT] [-s FLOAT]
[-A FLOAT] [-l INT] [-n INT] [-o [FILE]] [-m STR] [-N] [--tmp-dir DIR]
[-A FLOAT] [-l INT] [-n INT] [--match-reward INT]
[--mismatch-penalty INT] [--gap-open-penalty INT]
[--gap-extension-penalty INT] [--lambda FLOAT] [-k FLOAT] [-o [FILE]]
[-m STR] [-N] [--tmp-dir DIR]

optional arguments:
-h, --help show this help message and exit
Expand All @@ -117,6 +120,15 @@ Re-assign optional arguments:
Minimum read length (default: 30)
-n INT, --min-read-count INT
Minimum read count (default: 3)
--match-reward INT Match reward for the alignment score (default: 1)
--mismatch-penalty INT
Mismatch penalty for the alignment score (default: -2)
--gap-open-penalty INT
Gap open penalty for alignment score computation (default: 5)
--gap-extension-penalty INT
Gap extension penalty for the alignment score (default: 2)
--lambda FLOAT Lambda parameter for the alignment score (default: 1.33)
-k FLOAT K parameter for the alignment score algorithm (default: 0.621)
-o [FILE], --out-bam [FILE]
Save a BAM file without multimapping reads (default: None)
-m STR, --sort-memory STR
Expand Down
4 changes: 3 additions & 1 deletion bam_filter/lca.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,9 @@ def get_taxonomy_info(refids, taxdb, acc2taxid, nprocs=1, custom=False):
else:
cols = ["accession.version", "taxid"]
name = "accession.version"

dt.options.progress.enabled = True
dt.options.progress.clear_on_success = True
dt.options.nthreads = nprocs
filtered_df = dt.fread(acc2taxid, verbose=False, nthreads=nprocs)
filt = dt.Frame(refids, names=[name])
filt["FOO"] = 1
Expand Down
78 changes: 71 additions & 7 deletions bam_filter/reassign.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ def resolve_multimaps(data, scale=0.9, iters=10):
progress_bar = tqdm.tqdm(
total=9,
desc=f"Iter {current_iter + 1}",
unit="step",
unit=" step",
disable=False, # Replace with your logic or a boolean value
leave=False,
ncols=80,
Expand Down Expand Up @@ -422,11 +422,44 @@ def write_reassigned_bam(
os.remove(out_files["bam_reassigned_tmp"])


def calculate_alignment_score(identity, read_length):
return (identity / math.log(read_length)) * math.sqrt(read_length)


def get_bam_data(parms, ref_lengths=None, percid=90, min_read_length=30, threads=1):
# def calculate_alignment_score(identity, read_length):
# return (identity / math.log(read_length)) * math.sqrt(read_length)

# Values from:
# https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/algo/blast/core/blast_stat.c
def calculate_alignment_score(
num_matches,
num_mismatches,
num_gaps,
gap_extensions,
match_reward,
mismatch_penalty,
gap_open_penalty,
gap_extension_penalty,
lambda_value,
K_value
):
# Calculate the raw alignment score
S = (num_matches * match_reward) - (num_mismatches * mismatch_penalty) - (num_gaps * gap_open_penalty) - (gap_extensions * gap_extension_penalty)

# Calculate the approximate bit score
bit_score = (lambda_value * S - math.log(K_value)) / math.log(2)

return bit_score

def get_bam_data(
parms,
ref_lengths=None,
percid=90,
min_read_length=30,
threads=1,
match_reward=1,
mismatch_penalty=-1,
gap_open_penalty=1,
gap_extension_penalty=2,
lambda_value=1.02,
K_value=0.21
):
bam, references = parms
dt.options.progress.enabled = True
dt.options.progress.clear_on_success = True
Expand Down Expand Up @@ -464,8 +497,12 @@ def get_bam_data(parms, ref_lengths=None, percid=90, min_read_length=30, threads
continue

pident = (1 - ((aln.get_tag("NM") / query_length))) * 100
num_mismatches = aln.get_tag("NM")
num_matches = query_length - num_mismatches
num_gaps = aln.get_tag("XO") if aln.has_tag("XO") else 0
gap_extensions = aln.get_tag("XG") if aln.has_tag("XG") else 0
if pident >= percid:
var = calculate_alignment_score(pident, query_length)
var = calculate_alignment_score(num_matches=num_matches, num_mismatches=num_mismatches, num_gaps=num_gaps, gap_extensions=gap_extensions, match_reward=match_reward, mismatch_penalty=mismatch_penalty, gap_open_penalty=gap_open_penalty, gap_extension_penalty=gap_extension_penalty, lambda_value=lambda_value, K_value=K_value)
aln_data.append(
(
aln.query_name,
Expand Down Expand Up @@ -498,6 +535,12 @@ def get_bam_data(parms, ref_lengths=None, percid=90, min_read_length=30, threads
def reassign_reads(
bam,
out_files,
match_reward,
mismatch_penalty,
gap_open_penalty,
gap_extension_penalty,
lambda_value,
K_value,
reference_lengths=None,
threads=1,
min_read_count=1,
Expand Down Expand Up @@ -586,7 +629,14 @@ def reassign_reads(
ref_lengths=ref_len_dict,
percid=min_read_ani,
min_read_length=min_read_length,
match_reward=match_reward,
mismatch_penalty=mismatch_penalty,
gap_open_penalty=gap_open_penalty,
gap_extension_penalty=gap_extension_penalty,
lambda_value=lambda_value,
K_value=K_value,
threads=4,

),
parms,
chunksize=1,
Expand All @@ -608,7 +658,15 @@ def reassign_reads(
get_bam_data,
ref_lengths=ref_len_dict,
percid=min_read_ani,
min_read_length=min_read_length,
match_reward=match_reward,
mismatch_penalty=mismatch_penalty,
gap_open_penalty=gap_open_penalty,
gap_extension_penalty=gap_extension_penalty,
lambda_value=lambda_value,
K_value=K_value,
threads=4,

),
parms,
chunksize=1,
Expand Down Expand Up @@ -904,5 +962,11 @@ def reassign(args):
sort_memory=args.sort_memory,
sort_by_name=args.sort_by_name,
out_files=out_files,
match_reward=args.match_reward,
mismatch_penalty=args.mismatch_penalty,
gap_open_penalty=args.gap_open_penalty,
gap_extension_penalty=args.gap_extension_penalty,
lambda_value=args.lambda_value,
K_value=args.K_value,
)
log.info("Done!")
86 changes: 84 additions & 2 deletions bam_filter/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -415,7 +415,12 @@ def check_lca_ranks(val, parser, var):
"tmp_dir": None,
"reassign_iters": 25,
"reassign_scale": 0.9,
"reassign_target": "mapq",
"reassign_match_reward": 1,
"reassign_mismatch_penalty": -2,
"reassign_gap_open_penalty": 5,
"reassign_gap_extension_penalty": 2,
"reassign_lambda": 1.33,
"reassign_k": 0.621,
"rank_lca": "species",
"lca_summary": None,
}
Expand Down Expand Up @@ -461,7 +466,12 @@ def check_lca_ranks(val, parser, var):
"reassign_method": "Method for the EM algorithm",
"reassign_iters": "Number of iterations for the EM algorithm",
"reassign_scale": "Scale to select the best weithing alignments",
"reassign_target": "Which target to use for the EM algorith, Only mapq or pident.",
"reassign_match_reward": "Match reward for the alignment score ",
"reassign_mismatch_penalty": "Mismatch penalty for the alignment score ",
"reassign_gap_open_penalty": "Gap open penalty for alignment score computation",
"reassign_gap_extension_penalty": "Gap extension penalty for the alignment score",
"reassign_lambda": "Lambda parameter for the alignment score",
"reassign_k": "K parameter for the alignment score algorithm",
"lca": "Calculate LCA for each read and estimate abundances",
"names": "Names dmp file from taxonomy",
"nodes": "Nodes dmp file from taxonomy",
Expand Down Expand Up @@ -658,6 +668,78 @@ def get_arguments(argv=None):
dest="min_read_count",
help=help_msg["min_read_count"],
)
reassign_optional_args.add_argument(
"--match-reward",
type=lambda x: int(
check_values(
x, minval=0, maxval=np.Inf, parser=parser, var="--match-reward"
)
),
default=defaults["reassign_match_reward"],
metavar="INT",
dest="match_reward",
help=help_msg["reassign_match_reward"],
)
reassign_optional_args.add_argument(
"--mismatch-penalty",
type=lambda x: int(
check_values(
x, minval=-np.Inf, maxval=0, parser=parser, var="--mismatch-penalty"
)
),
default=defaults["reassign_mismatch_penalty"],
metavar="INT",
dest="mismatch_penalty",
help=help_msg["reassign_mismatch_penalty"],
)
reassign_optional_args.add_argument(
"--gap-open-penalty",
type=lambda x: int(
check_values(
x, minval=0, maxval=np.Inf, parser=parser, var="--gap-open-penalty"
)
),
default=defaults["reassign_gap_open_penalty"],
metavar="INT",
dest="gap_open_penalty",
help=help_msg["reassign_gap_open_penalty"],
)
reassign_optional_args.add_argument(
"--gap-extension-penalty",
type=lambda x: int(
check_values(
x, minval=0, maxval=np.Inf, parser=parser, var="--gap-extension-penalty"
)
),
default=defaults["reassign_gap_extension_penalty"],
metavar="INT",
dest="gap_extension_penalty",
help=help_msg["reassign_gap_extension_penalty"],
)
reassign_optional_args.add_argument(
"--lambda",
type=lambda x: float(
check_values(
x, minval=0, maxval=np.Inf, parser=parser, var="--lambda"
)
),
default=defaults["reassign_lambda"],
metavar="FLOAT",
dest="lambda_value",
help=help_msg["reassign_lambda"],
)
reassign_optional_args.add_argument(
"-k",
type=lambda x: float(
check_values(
x, minval=0, maxval=np.Inf, parser=parser, var="-K"
)
),
default=defaults["reassign_k"],
metavar="FLOAT",
dest="K_value",
help=help_msg["reassign_k"],
)
reassign_optional_args.add_argument(
"-o",
"--out-bam",
Expand Down
Loading