Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs #99

Merged
merged 9 commits into from
Dec 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 4 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- Confidence intervals for both absolute and percentage change.
- Sample ratio mismatch check.
- Power analysis.
- Multiple hypothesis testing (family-wise error rate and false discovery rate).

**tea-tasting** calculates statistics directly within data backends such as BigQuery, ClickHouse, PostgreSQL, Snowflake, Spark, and 20+ other backends supported by [Ibis](https://ibis-project.org/). This approach eliminates the need to import granular data into a Python environment, though Pandas DataFrames are also supported.

Expand Down Expand Up @@ -51,18 +52,16 @@ print(result)
#> revenue_per_user 5.24 5.73 9.3% [-2.4%, 22%] 0.123
```

Learn more in the detailed [user guide](https://tea-tasting.e10v.me/user-guide/). Additionally, see the guides on [data backends](https://tea-tasting.e10v.me/data-backends/), [power analysis](https://tea-tasting.e10v.me/power-analysis/), and [custom metrics](https://tea-tasting.e10v.me/custom-metrics/).
Learn more in the detailed [user guide](https://tea-tasting.e10v.me/user-guide/). Additionally, see the guides on [data backends](https://tea-tasting.e10v.me/data-backends/), [power analysis](https://tea-tasting.e10v.me/power-analysis/), [multiple hypothesis testing](https://tea-tasting.e10v.me/multiple-testing/), and [custom metrics](https://tea-tasting.e10v.me/custom-metrics/).

## Roadmap

- Multiple hypotheses testing:
- Family-wise error rate: Holm–Bonferroni method.
- False discovery rate: Benjamini–Hochberg procedure.
- Support more dataframes with [Narwhals](https://github.com/narwhals-dev/narwhals).
- A/A tests and simulations.
- More statistical tests:
- Asymptotic and exact tests for frequency data.
- Mann–Whitney U test.
- Sequential testing: always valid p-value with mSPRT.
- Sequential testing.

## Package name

Expand Down
5 changes: 3 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
- Confidence intervals for both absolute and percentage change.
- Sample ratio mismatch check.
- Power analysis.
- Multiple hypotheses testing (family-wise error rate and false discovery rate).
- Multiple hypothesis testing (family-wise error rate and false discovery rate).

**tea-tasting** calculates statistics directly within data backends such as BigQuery, ClickHouse, PostgreSQL, Snowflake, Spark, and 20+ other backends supported by [Ibis](https://ibis-project.org/). This approach eliminates the need to import granular data into a Python environment, though Pandas DataFrames are also supported.

Expand Down Expand Up @@ -52,10 +52,11 @@ print(result)
#> revenue_per_user 5.24 5.73 9.3% [-2.4%, 22%] 0.123
```

Learn more in the detailed [user guide](https://tea-tasting.e10v.me/user-guide/). Additionally, see the guides on [data backends](https://tea-tasting.e10v.me/data-backends/), [power analysis](https://tea-tasting.e10v.me/power-analysis/), and [custom metrics](https://tea-tasting.e10v.me/custom-metrics/).
Learn more in the detailed [user guide](https://tea-tasting.e10v.me/user-guide/). Additionally, see the guides on [data backends](https://tea-tasting.e10v.me/data-backends/), [power analysis](https://tea-tasting.e10v.me/power-analysis/), [multiple hypothesis testing](https://tea-tasting.e10v.me/multiple-testing/), and [custom metrics](https://tea-tasting.e10v.me/custom-metrics/).

## Roadmap

- Support more dataframes with [Narwhals](https://github.com/narwhals-dev/narwhals).
- A/A tests and simulations.
- More statistical tests:
- Asymptotic and exact tests for frequency data.
Expand Down
181 changes: 181 additions & 0 deletions docs/multiple-testing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# Multiple testing

## Multiple hypothesis testing problem

The [multiple hypothesis testing problem](https://en.wikipedia.org/wiki/Multiple_comparisons_problem) arises when there is more than one success metric or more than one treatment variant in an A/B test.

**tea-tasting** provides the following methods for multiple testing correction:

- [False discovery rate](https://en.wikipedia.org/wiki/False_discovery_rate) (FDR) controlling procedures:
- Benjamini-Yekutieli procedure, assuming arbitrary dependence between hypotheses.
- Benjamini-Hochberg procedure, assuming non-negative correlation between hypotheses.
- [Family-wise error rate](https://en.wikipedia.org/wiki/Family-wise_error_rate) (FWER) controlling procedures:
- Holm's step-down procedure, assuming arbitrary dependence between hypotheses.
- Hochberg's step-up procedure, assuming non-negative correlation between hypotheses.

As an example, let's consider an experiment with three variants, a control and two treatments:

```python
import pandas as pd
import tea_tasting as tt


data = pd.concat((
tt.make_users_data(seed=42, orders_uplift=0.10, revenue_uplift=0.15),
tt.make_users_data(seed=21, orders_uplift=0.15, revenue_uplift=0.20)
.query("variant==1")
.assign(variant=2),
))
print(data)
#> user variant sessions orders revenue
#> 0 0 1 2 1 9.582790
#> 1 1 0 2 1 6.434079
#> 2 2 1 2 1 8.304958
#> 3 3 1 2 1 16.652705
#> 4 4 0 1 1 7.136917
#> ... ... ... ... ... ...
#> 3989 3989 2 4 4 34.931448
#> 3991 3991 2 1 0 0.000000
#> 3992 3992 2 3 3 27.964647
#> 3994 3994 2 2 1 17.217892
#> 3998 3998 2 3 0 0.000000
#>
#> [6046 rows x 5 columns]
```

Let's calculate the experiment results:

```python
experiment = tt.Experiment(
sessions_per_user=tt.Mean("sessions"),
orders_per_session=tt.RatioOfMeans("orders", "sessions"),
orders_per_user=tt.Mean("orders"),
revenue_per_user=tt.Mean("revenue"),
)

results = experiment.analyze(data, control=0, all_variants=True)
print(results)
#> variants metric control treatment rel_effect_size rel_effect_size_ci pvalue
#> (0, 1) sessions_per_user 2.00 1.98 -0.66% [-3.7%, 2.5%] 0.674
#> (0, 1) orders_per_session 0.266 0.289 8.8% [-0.89%, 19%] 0.0762
#> (0, 1) orders_per_user 0.530 0.573 8.0% [-2.0%, 19%] 0.118
#> (0, 1) revenue_per_user 5.24 5.99 14% [2.1%, 28%] 0.0212
#> (0, 2) sessions_per_user 2.00 2.02 0.98% [-2.1%, 4.1%] 0.532
#> (0, 2) orders_per_session 0.266 0.295 11% [1.2%, 22%] 0.0273
#> (0, 2) orders_per_user 0.530 0.594 12% [1.7%, 23%] 0.0213
#> (0, 2) revenue_per_user 5.24 6.25 19% [6.6%, 33%] 0.00218
```

Suppose only the two metrics `orders_per_user` and `revenue_per_user` are considered as success metrics, while the two other metrics `sessions_per_user` and `orders_per_session` are second-orders diagnostic metrics.

```python
metrics = {"orders_per_user", "revenue_per_user"}
```

With two treatment variants and two success metrics, there are four hypotheses in total, which increases the probability of false positives (also called "false discoveries"). It's recommended to adjust the p-values or the significance level alpha in this case. Let's explore the correction methods provided by **tea-tasting**.

## False discovery rate

False discovery rate (FDR) is the expected value of the proportion of false discoveries among the discoveries (rejections of the null hypothesis). To control for FDR, use the [`adjust_fdr`](api/multiplicity.md#tea_tasting.multiplicity.adjust_fdr) method:

```python
adjusted_results_fdr = tt.adjust_fdr(results, metrics)
print(adjusted_results_fdr)
#> comparison metric control treatment rel_effect_size pvalue pvalue_adj
#> (0, 1) orders_per_user 0.530 0.573 8.0% 0.118 0.245
#> (0, 1) revenue_per_user 5.24 5.99 14% 0.0212 0.0592
#> (0, 2) orders_per_user 0.530 0.594 12% 0.0213 0.0592
#> (0, 2) revenue_per_user 5.24 6.25 19% 0.00218 0.0182
```

The method adjusts p-values and saves them as `pvalue_adj`. Compare these values to the desired significance level alpha to determine if the null hypotheses can be rejected.

The method also adjusts the significance level alpha and saves it as `alpha_adj`. Compare non-adjusted p-values (`pvalue`) to the `alpha_adj` to determine if the null hypotheses can be rejected:

```python
print(adjusted_results_fdr.to_string(keys=(
"comparison",
"metric",
"control",
"treatment",
"rel_effect_size",
"pvalue",
"alpha_adj",
)))
#> comparison metric control treatment rel_effect_size pvalue alpha_adj
#> (0, 1) orders_per_user 0.530 0.573 8.0% 0.118 0.0240
#> (0, 1) revenue_per_user 5.24 5.99 14% 0.0212 0.0120
#> (0, 2) orders_per_user 0.530 0.594 12% 0.0213 0.0180
#> (0, 2) revenue_per_user 5.24 6.25 19% 0.00218 0.00600
```

By default, **tea-tasting** assumes arbitrary dependence between hypotheses and performs the Benjamini-Yekutieli procedure. To perform the Benjamini-Hochberg procedure, assuming non-negative correlation between hypotheses, set the `arbitrary_dependence` parameter to `False`:

```python
print(tt.adjust_fdr(results, metrics, arbitrary_dependence=False))
#> comparison metric control treatment rel_effect_size pvalue pvalue_adj
#> (0, 1) orders_per_user 0.530 0.573 8.0% 0.118 0.118
#> (0, 1) revenue_per_user 5.24 5.99 14% 0.0212 0.0284
#> (0, 2) orders_per_user 0.530 0.594 12% 0.0213 0.0284
#> (0, 2) revenue_per_user 5.24 6.25 19% 0.00218 0.00873
```

## Family-wise error rate

Family-wise error rate (FWER) is the probability of making at least one type I error. To control for FWER, use the [`adjust_fwer`](api/multiplicity.md#tea_tasting.multiplicity.adjust_fwer) method:

```python
print(tt.adjust_fwer(results, metrics))
#> comparison metric control treatment rel_effect_size pvalue pvalue_adj
#> (0, 1) orders_per_user 0.530 0.573 8.0% 0.118 0.118
#> (0, 1) revenue_per_user 5.24 5.99 14% 0.0212 0.0635
#> (0, 2) orders_per_user 0.530 0.594 12% 0.0213 0.0635
#> (0, 2) revenue_per_user 5.24 6.25 19% 0.00218 0.00873
```

By default, **tea-tasting** assumes arbitrary dependence between hypotheses and performs the Holm's step-down procedure with Bonferroni correction. To perform the Hochberg's step-up procedure, assuming non-negative correlation between hypotheses, set the `arbitrary_dependence` parameter to `False`. In this case, you can also use the slightly more powerful Šidák correction instead of the Bonferroni correction:

```python
print(tt.adjust_fwer(
results,
metrics,
arbitrary_dependence=False,
method="sidak",
))
#> comparison metric control treatment rel_effect_size pvalue pvalue_adj
#> (0, 1) orders_per_user 0.530 0.573 8.0% 0.118 0.118
#> (0, 1) revenue_per_user 5.24 5.99 14% 0.0212 0.0422
#> (0, 2) orders_per_user 0.530 0.594 12% 0.0213 0.0422
#> (0, 2) revenue_per_user 5.24 6.25 19% 0.00218 0.00870
```

## Other inputs

In the examples above, the methods `adjust_fdr` and `adjust_fwer` received results from a *single experiment* with *more than two variants*. They can also accept the results from *multiple experiments* with *two variants* in each:

```python
data1 = tt.make_users_data(seed=42, orders_uplift=0.10, revenue_uplift=0.15)
data2 = tt.make_users_data(seed=21, orders_uplift=0.15, revenue_uplift=0.20)

result1 = experiment.analyze(data1)
result2 = experiment.analyze(data2)

print(tt.adjust_fdr(
{"Experiment 1": result1, "Experiment 2": result2},
metrics,
))
#> comparison metric control treatment rel_effect_size pvalue pvalue_adj
#> Experiment 1 orders_per_user 0.530 0.573 8.0% 0.118 0.245
#> Experiment 1 revenue_per_user 5.24 5.99 14% 0.0212 0.0588
#> Experiment 2 orders_per_user 0.514 0.594 16% 0.00427 0.0178
#> Experiment 2 revenue_per_user 5.10 6.25 22% 6.27e-04 0.00523
```

The methods `adjust_fdr` and `adjust_fwer` can also accept the result of *a single experiment with two variants*:

```python
print(tt.adjust_fwer(result2, metrics))
#> comparison metric control treatment rel_effect_size pvalue pvalue_adj
#> - orders_per_user 0.514 0.594 16% 0.00427 0.00427
#> - revenue_per_user 5.10 6.25 22% 6.27e-04 0.00125
```
63 changes: 60 additions & 3 deletions docs/user-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -336,11 +336,68 @@ experiment.metrics["orders_per_user"]

In **tea-tasting**, it's possible to analyze experiments with more than two variants. However, the variants will be compared in pairs through two-sample statistical tests.

Example usage:

```python
data = pd.concat((
tt.make_users_data(seed=42),
tt.make_users_data(seed=21).query("variant==1").assign(variant=2),
))

experiment = tt.Experiment(
sessions_per_user=tt.Mean("sessions"),
orders_per_session=tt.RatioOfMeans("orders", "sessions"),
orders_per_user=tt.Mean("orders"),
revenue_per_user=tt.Mean("revenue"),
)

results = experiment.analyze(data, control=0, all_variants=True)
print(results)
#> variants metric control treatment rel_effect_size rel_effect_size_ci pvalue
#> (0, 1) sessions_per_user 2.00 1.98 -0.66% [-3.7%, 2.5%] 0.674
#> (0, 1) orders_per_session 0.266 0.289 8.8% [-0.89%, 19%] 0.0762
#> (0, 1) orders_per_user 0.530 0.573 8.0% [-2.0%, 19%] 0.118
#> (0, 1) revenue_per_user 5.24 5.73 9.3% [-2.4%, 22%] 0.123
#> (0, 2) sessions_per_user 2.00 2.02 0.98% [-2.1%, 4.1%] 0.532
#> (0, 2) orders_per_session 0.266 0.273 2.8% [-6.6%, 13%] 0.575
#> (0, 2) orders_per_user 0.530 0.550 3.8% [-6.0%, 15%] 0.465
#> (0, 2) revenue_per_user 5.24 5.41 3.1% [-8.1%, 16%] 0.599
```

How variant pairs are determined:

- Specified control variant: If a specific variant is set as `control`, as in the example above, it is then compared against each of the other variants.
- Default control variant: When the `control` parameter of the `analyze` method is set to `None`, **tea-tasting** automatically compares each variant pair. The variant with the lowest ID in each pair is a control.
- Specified control variant: If a specific variant is set as `control`, it is then compared against each of the other variants.

The result of the analysis is a dictionary of `ExperimentResult` objects with tuples (control, treatment) as keys.
Example usage without specifying a control variant:

```python
results = experiment.analyze(data, all_variants=True)
print(results)
#> variants metric control treatment rel_effect_size rel_effect_size_ci pvalue
#> (0, 1) sessions_per_user 2.00 1.98 -0.66% [-3.7%, 2.5%] 0.674
#> (0, 1) orders_per_session 0.266 0.289 8.8% [-0.89%, 19%] 0.0762
#> (0, 1) orders_per_user 0.530 0.573 8.0% [-2.0%, 19%] 0.118
#> (0, 1) revenue_per_user 5.24 5.73 9.3% [-2.4%, 22%] 0.123
#> (0, 2) sessions_per_user 2.00 2.02 0.98% [-2.1%, 4.1%] 0.532
#> (0, 2) orders_per_session 0.266 0.273 2.8% [-6.6%, 13%] 0.575
#> (0, 2) orders_per_user 0.530 0.550 3.8% [-6.0%, 15%] 0.465
#> (0, 2) revenue_per_user 5.24 5.41 3.1% [-8.1%, 16%] 0.599
#> (1, 2) sessions_per_user 1.98 2.02 1.7% [-1.4%, 4.8%] 0.294
#> (1, 2) orders_per_session 0.289 0.273 -5.5% [-14%, 3.6%] 0.225
#> (1, 2) orders_per_user 0.573 0.550 -4.0% [-13%, 5.7%] 0.407
#> (1, 2) revenue_per_user 5.73 5.41 -5.7% [-16%, 5.8%] 0.319
```

The result of the analysis is a mapping of `ExperimentResult` objects with tuples (control, treatment) as keys. You can view the result for a selected pair of variants:

```python
print(results[0, 1])
#> metric control treatment rel_effect_size rel_effect_size_ci pvalue
#> sessions_per_user 2.00 1.98 -0.66% [-3.7%, 2.5%] 0.674
#> orders_per_session 0.266 0.289 8.8% [-0.89%, 19%] 0.0762
#> orders_per_user 0.530 0.573 8.0% [-2.0%, 19%] 0.118
#> revenue_per_user 5.24 5.73 9.3% [-2.4%, 22%] 0.123
```

Keep in mind that **tea-tasting** does not adjust for multiple comparisons. When dealing with multiple variant pairs, additional steps may be necessary to account for this, depending on your analysis needs.
By default, **tea-tasting** does not adjust for multiple hypothesis testing. However, it provides several methods for multiple testing correction. For more details, see the the [guide on multiple hypothesis testing](multiple-testing.md).
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ nav:
- User guide: user-guide.md
- Data backends: data-backends.md
- Power analysis: power-analysis.md
- Multiple testing: multiple-testing.md
- Custom metrics: custom-metrics.md
- API reference:
- API reference: api/index.md
Expand Down
2 changes: 1 addition & 1 deletion src/tea_tasting/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

- `tea_tasting.metrics`: Built-in metrics.
- `tea_tasting.experiment`: Experiment and experiment result.
- `tea_tasting.multiplicity`: Multiple hypotheses testing.
- `tea_tasting.multiplicity`: Multiple hypothesis testing.
- `tea_tasting.datasets`: Example datasets.
- `tea_tasting.config`: Global configuration.
- `tea_tasting.aggr`: Module for working with aggregated statistics.
Expand Down
Loading
Loading