-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test differences date by date #206
Comments
Dear Alexis @alexisrosuel, thanks a lot for the suggestion. What you're talking about here seems to me an instance of the 'early stopping' problem to me and is a subject of the multiple hypothesis testing issues. The more often you look at your p-value, the higher the probability to see spurious significance by chance. Expan kind of supports early stopping in a highly experimental mode and tries to mitigate the risk of spurious early stopping by applying a stricter p-value threshold when there is less data than expected. But it always consume all the date which is present in the dataframe. Let me know if I understood your question correctly. Best, |
Hi Grisha, In fact the idea behind this chart (and the whole airbnb medium article) is the opposite. They wanted to point out that the pvalue can fluctuate through time, go below the signifiance threshold, and then stay there forever or not. The chart show this : if you stop the experiment represented here around day 10, you commit type 1 error. But I you let the experiment run for a few more days, you see that the pvalue in fact "converges" around its true value. To recap, this does not provide an early stopping criteria. This helps to monitor wether the pvalue has still an erratic behaviour (so we can't stop the experiment at this moment), or if it hasn't changed sinced a "long time" (to be defined). For me the ideal criteria is :
What do you think of it? |
Please pardon my poor expression: What I meant in my first reply is exactly what you're talking about
Our early stopping logic counteracts the effects like this by reducing the alpha-threshold at the beginning of experiment (where you've got less data), so it's not 0.05, but much larger for small quantities of data in the first days. |
Oh indeed I see your point now too :) Yes, expan use some kind of "dynamic pvalue threshold", so we could draft this value day by day, along with the observed pvalue? |
Yes the "dynamic threshold" is based on information fraction, which is ratio of current sample size and estimated sample size for the experiment. Here is the method we use: https://github.com/zalando/expan/blob/master/expan/core/early_stopping.py#L24-L36 |
Whether it is day-by-day analysis or other periods, will depends on how your code calls ExpAn. |
Context
It is very useful when running ab test to see the evolution of the difference / pvalues / credible interval / etc. through time. For instance if I start an experiment on
2018-04-01
, and finish it on2018-04-30
, I would like to know what was the state (in term of pvalue, etc.) each day. It helps to visualize if the test has "converged" or not.(source : https://medium.com/airbnb-engineering/experiments-at-airbnb-e2db3abf39e7 )
Proposition
Would it be possible to apply sequentially the statistical analysis date by date (it could apply the analysis to the sequence
[df[df.date <= dt.datetime(2018-04-01) + dt.timedelta(days=i)] for i in range(30)]
, and then report the same json, but with a date level at the top. (Maybe there is a much cleaner architecture than this !)Thanks
The text was updated successfully, but these errors were encountered: