Test differences date by date #206

alexisrosuel · 2018-03-19T19:55:17Z

Context

It is very useful when running ab test to see the evolution of the difference / pvalues / credible interval / etc. through time. For instance if I start an experiment on 2018-04-01, and finish it on 2018-04-30, I would like to know what was the state (in term of pvalue, etc.) each day. It helps to visualize if the test has "converged" or not.

(source : https://medium.com/airbnb-engineering/experiments-at-airbnb-e2db3abf39e7 )

Proposition

Would it be possible to apply sequentially the statistical analysis date by date (it could apply the analysis to the sequence [df[df.date <= dt.datetime(2018-04-01) + dt.timedelta(days=i)] for i in range(30)], and then report the same json, but with a date level at the top. (Maybe there is a much cleaner architecture than this !)

Thanks

The text was updated successfully, but these errors were encountered:

gbordyugov · 2018-03-21T08:22:42Z

Dear Alexis @alexisrosuel,

thanks a lot for the suggestion. What you're talking about here seems to me an instance of the 'early stopping' problem to me and is a subject of the multiple hypothesis testing issues. The more often you look at your p-value, the higher the probability to see spurious significance by chance.

Expan kind of supports early stopping in a highly experimental mode and tries to mitigate the risk of spurious early stopping by applying a stricter p-value threshold when there is less data than expected. But it always consume all the date which is present in the dataframe.

Let me know if I understood your question correctly.

Best,
Grisha

alexisrosuel · 2018-03-21T08:47:32Z

Hi Grisha,

In fact the idea behind this chart (and the whole airbnb medium article) is the opposite. They wanted to point out that the pvalue can fluctuate through time, go below the signifiance threshold, and then stay there forever or not.

The chart show this : if you stop the experiment represented here around day 10, you commit type 1 error. But I you let the experiment run for a few more days, you see that the pvalue in fact "converges" around its true value.

To recap, this does not provide an early stopping criteria. This helps to monitor wether the pvalue has still an erratic behaviour (so we can't stop the experiment at this moment), or if it hasn't changed sinced a "long time" (to be defined). For me the ideal criteria is :

look at the true statistical early stopping criteria (the aim of this package)
accept this results iff the pvalue graph has converged

What do you think of it?

gbordyugov · 2018-03-21T08:56:45Z

Please pardon my poor expression: What I meant in my first reply is exactly what you're talking about

The more often you look at your p-value, the higher the probability to see spurious significance by chance.

Our early stopping logic counteracts the effects like this by reducing the alpha-threshold at the beginning of experiment (where you've got less data), so it's not 0.05, but much larger for small quantities of data in the first days.

alexisrosuel · 2018-03-21T10:01:54Z

Oh indeed I see your point now too :)

Yes, expan use some kind of "dynamic pvalue threshold", so we could draft this value day by day, along with the observed pvalue?

shansfolder · 2018-03-21T10:57:49Z

Yes the "dynamic threshold" is based on information fraction, which is ratio of current sample size and estimated sample size for the experiment.

Here is the method we use: https://github.com/zalando/expan/blob/master/expan/core/early_stopping.py#L24-L36

shansfolder · 2018-03-21T10:59:14Z

Whether it is day-by-day analysis or other periods, will depends on how your code calls ExpAn.

shansfolder added the question label Mar 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test differences date by date #206

Test differences date by date #206

alexisrosuel commented Mar 19, 2018

gbordyugov commented Mar 21, 2018 •

edited

Loading

alexisrosuel commented Mar 21, 2018

gbordyugov commented Mar 21, 2018

alexisrosuel commented Mar 21, 2018

shansfolder commented Mar 21, 2018

shansfolder commented Mar 21, 2018

Test differences date by date #206

Test differences date by date #206

Comments

alexisrosuel commented Mar 19, 2018

Context

Proposition

gbordyugov commented Mar 21, 2018 • edited Loading

alexisrosuel commented Mar 21, 2018

gbordyugov commented Mar 21, 2018

alexisrosuel commented Mar 21, 2018

shansfolder commented Mar 21, 2018

shansfolder commented Mar 21, 2018

gbordyugov commented Mar 21, 2018 •

edited

Loading