Skip to content

K‐mer bounds lookup table

Tibo Vande Moortele edited this page Sep 18, 2024 · 1 revision

When searching for a peptide within the suffix array, we employ a binary search strategy across all the suffixes of the proteins in UniProtKB. In the 2024_04 release of UniProtKB, which contains more than 87 billion amino acids, we need to perform log2(87 * 10^9) ≈ 36,35 = 37 suffix comparisons, on average, per bound. Since we are identifying a range of suffixes that match with our peptide within the suffix array, two binary searches are required to determine the minimum and maximum bounds, resulting in a total of 74 comparisons per peptide. These comparisons are computationally costly and should be minimized. To reduce the number of searches, we precompute and store the initial bounds for all possible peptides of length k and shorter. This allows for a constant-time lookup, followed by a much smaller binary search step for each query.

For a k-mer table with factor k, there exist a total of 20k different k-mers. This can be seen as dividing the space of 87 billion amino acids into 20k compartments, where, on average, each compartment has size 2^37 / 20^k. By taking the logarithm log2(2^37 / 20^k) = 37 - k * log2(20), we can calculate how many text comparisons we still need to perform after consulting the k-mer bounds table. For a k of size 5, we only need 15 comparisons per bound instead of 37, resulting in significantly less work per search.

Clone this wiki locally