Skip to content

Commit

Permalink
Added birthday paradox corrections
Browse files Browse the repository at this point in the history
  • Loading branch information
LucaCappelletti94 committed Sep 2, 2024
1 parent d2806f0 commit 819034d
Show file tree
Hide file tree
Showing 15 changed files with 2,216 additions and 1,495 deletions.
1 change: 1 addition & 0 deletions hash_list_correction/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
switch_hash_correction_*.csv
10 changes: 10 additions & 0 deletions hash_list_correction/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,13 @@ version = "0.1.0"
edition = "2021"

[dependencies]
hyperloglog-rs = {path = "../", features = ["all_precisions", "std"]}
syn = "2.0"
quote = "1.0"
proc-macro2 = "1.0"
prettyplease = "0.2"
indicatif = { version = "0.17.8", features = ["rayon"] }
rayon = "1.10.0"
twox-hash = "1.6.3"
test_utils = {path="../test_utils"}
serde = { version = "1.0.208", features = ["derive"] }
62 changes: 62 additions & 0 deletions hash_list_correction/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Hash List correction
The Hash List low-cardinality correction approach for HyperLogLog counters is itself subject to biases, primarily derived by the so-called birthday paradox, i.e. the probability of two elements in a set of size n to have the same hash value is higher than one might expect. This is a well-known issue in computer science and cryptography, and it is the reason why hash functions are designed to be collision-resistant. Fortunately, while theoretically hard to estimate because of the several techniques we employ, we can empirically measure the bias introduced by the Hash List correction approach.

Once the error is measured, we can trivially subtract it to the cardinality estimate and obtain a much nicer result. This is the purpose of the `hash_list_correction` module, which provides a simple interface to measure the bias and correct the cardinality estimate.

## Usage
As any rust script, just use:

```bash
RUSTFLAGS='-C target-cpu=native' cargo run --release
```

## Switch Hash Results
The results regarding the switch hash are as follows:

| precision | bits | maximal_mean_relative_error | peak_estimated_cardinality | bias | error_reduction |
|-----------|------|-----------------------------|-----------------------------|------|-----------------|
| 4 | 4 | 0.0691152 | 7.45 | 1.00 | 56.22 |
| 4 | 5 | 0.0001191 | 8.00 | 1.00 | 156.17 |
| 4 | 6 | 0.0001077 | 8.00 | 1.00 | 109.75 |
| 5 | 4 | 0.0001238 | 8.00 | 1.00 | 105.03 |
| 5 | 5 | 0.0001798 | 12.00 | 1.00 | 57.33 |
| 5 | 6 | 0.0001709 | 12.00 | 1.00 | 70.69 |
| 6 | 4 | 0.0002604 | 16.00 | 1.00 | 191.23 |
| 6 | 5 | 0.0002986 | 19.99 | 1.00 | 162.92 |
| 6 | 6 | 0.0003535 | 23.99 | 1.00 | 221.90 |
| 7 | 4 | 0.0004990 | 31.98 | 1.00 | 826.10 |
| 7 | 5 | 0.0005987 | 39.98 | 1.00 | 751.54 |
| 7 | 6 | 0.0007155 | 47.97 | 1.00 | 668.20 |
| 8 | 4 | 0.0005582 | 63.96 | 41.00 | 26.07 |
| 8 | 5 | 0.0006774 | 79.95 | 51.00 | 24.80 |
| 8 | 6 | 0.0008063 | 95.92 | 62.00 | 26.04 |
| 9 | 4 | 0.0010797 | 127.86 | 83.00 | 25.15 |
| 9 | 5 | 0.0013529 | 159.78 | 103.00 | 25.56 |
| 9 | 6 | 0.0016084 | 191.69 | 125.00 | 25.72 |
| 10 | 4 | 0.0021341 | 255.45 | 165.00 | 26.05 |
| 10 | 5 | 0.0026572 | 319.15 | 207.00 | 24.27 |
| 10 | 6 | 0.0349638 | 379.62 | 251.00 | 17.03 |
| 11 | 4 | 0.0041498 | 509.88 | 331.99 | 21.21 |
| 11 | 5 | 0.0293490 | 637.79 | 414.99 | 17.81 |
| 11 | 6 | 0.0000331 | 511.98 | 1.00 | 207.34 |
| 12 | 4 | 0.0235120 | 1021.98 | 664.97 | 20.97 |
| 12 | 5 | 0.0000586 | 852.95 | 1.00 | 203.13 |
| 12 | 6 | 0.0000633 | 1023.94 | 1.00 | 283.56 |
| 13 | 4 | 0.0000954 | 1364.87 | 1.00 | 292.92 |
| 13 | 5 | 0.0001082 | 1705.82 | 1.00 | 302.12 |
| 13 | 6 | 0.0001232 | 2047.75 | 1.00 | 352.58 |
| 14 | 4 | 0.0000813 | 2729.78 | 2017.00 | 22.68 |
| 14 | 5 | 0.0000917 | 3412.69 | 2523.00 | 22.42 |
| 14 | 6 | 0.0001078 | 4095.56 | 3028.00 | 22.78 |
| 15 | 4 | 0.0001507 | 5460.18 | 4034.00 | 23.08 |
| 15 | 5 | 0.0001801 | 6824.77 | 5040.99 | 22.02 |
| 15 | 6 | 0.0002143 | 8190.24 | 6049.99 | 21.92 |
| 16 | 4 | 0.0002903 | 10918.83 | 8058.98 | 22.23 |
| 16 | 5 | 0.0003558 | 13649.14 | 10082.97 | 21.80 |
| 16 | 6 | 0.0004274 | 16380.00 | 12106.96 | 21.89 |
| 17 | 4 | 0.0005662 | 21838.63 | 16133.93 | 21.76 |
| 17 | 5 | 0.0007094 | 27306.61 | 20172.88 | 21.90 |
| 17 | 6 | 0.0008523 | 32772.45 | 24199.83 | 22.47 |
| 18 | 4 | 0.0011257 | 43711.11 | 32258.70 | 22.13 |
| 18 | 5 | 0.0014110 | 54661.67 | 40339.54 | 22.14 |
| 18 | 6 | 0.0198708 | 66823.58 | 48436.36 | 35.60 |
12 changes: 11 additions & 1 deletion hash_list_correction/src/main.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
//! Rust script to identify the optimal correction
#![deny(unsafe_code)]
#![deny(unused_macro_rules)]
#![deny(missing_docs)]
extern crate prettyplease;
extern crate proc_macro2;
extern crate quote;
extern crate syn;

mod switch_hash;
use switch_hash::compute_switch_hash_correction;

fn main() {
println!("Hello, world!");
compute_switch_hash_correction();
}
Loading

0 comments on commit 819034d

Please sign in to comment.