Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate exact null distirbution #85

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 31 additions & 1 deletion src/copairs/compute.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from tqdm.autonotebook import tqdm
from scipy.spatial.distance import cdist
from scipy.spatial.distance import _METRICS_NAMES as SCIPY_METRICS_NAMES
from scipy.stats import hypergeom


def parallel_map(par_func: Callable[[int], None], items: np.ndarray) -> None:
Expand Down Expand Up @@ -531,7 +532,7 @@ def get_null_dists(
# Function to generate null distributions for each configuration
def par_func(i):
num_pos, total = confs[i]
null_dists[i] = null_dist_cached(num_pos, total, seeds[i], null_size, cache_dir)
null_dists[i] = get_random_ap(total, num_pos)
Copy link
Collaborator

@alxndrkalinin alxndrkalinin Feb 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by the setting, we need to get a whole distribution here, while get_random_ap returns a single score.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, it already calculates average the exact expected random average precision for M choose n, the p-value should probably be calculated in a different way, not as "the proportion of null scores >= observed score".


# Parallelize the generation of null distributions
parallel_map(par_func, np.arange(num_confs))
Expand Down Expand Up @@ -645,3 +646,32 @@ def to_cutoffs(counts: np.ndarray) -> np.ndarray:
cutoffs[1:] = counts.cumsum()[:-1]

return cutoffs

def get_random_ap(M: int, n: int) -> float:
"""
Calculate average precision for a given N,m pair.

Parameters
----------
M : int
The total number of items.
n : int
The number of trials.

Returns
-------
divided : float
The calculated probability.

Notes
-----
This function uses the hypergeometric distribution to calculate the probability.
"""
k, N = np.indices((n, M)) + 1
p_at_N = k / N
result = hypergeom.pmf(np.arange(n)[::-1, np.newaxis], M, n, np.arange(M)[np.newaxis, ::-1])
norm = result * p_at_N * p_at_N
added = norm.sum()
divided = added / n

return divided
Loading