-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shouldn't compare just the means when deciding if a difference is significant #9
Comments
Sample patch https://gist.github.com/4352787 |
So you're saying that it's enough for either distribution to be 95% likely to produce a value that's the mean of the other distribution? Hmm, I'm not a statistician, but I'm not quite satisfied with that test. Maybe I should find a statistician and see what they say. If you want to make a pull request adding this test as an option in addition to the mean test, I'd accept that. |
An input from a statistician would certainly help. The current test in bench is statistically correct: it tests whether the mean of the difference is 0, i.e. if the mean changed in a statistically significant way. Here is an example: mean: 1.3614 ms, 95% CI: (1.3607 ms, 1.3617 ms) The difference between the means is 22.5us, which is considered significant according to bench. It might make sense to use the standard error of the difference instead of the standard error of the mean of the difference in the t-test, i.e. test whether the difference is 0, not that the mean of the difference is 0, but I'm not sure about that. To keep things simple lets calculate a prediction interval for the difference between the two measurements. If 0 is part of that interval, conclude that the difference is not significant. |
Pull request here: #10 There are some other minor build fixes here that you might wish to pull separately (and then regenerate with oasis 0.3.0): |
I wonder why these build fixes weren't accepted, the version in |
Failure to grab your build fixes from your repo. I've given you access to On Thu, May 9, 2013 at 8:38 AM, Sergei Lebedev [email protected]:
|
I've added the same function twice and run bench. Most of the time bench told me there is a difference between the two functions around 0.7-1%.
I dumped and loaded the values in GNU R, and I got similar results with t.test (i.e. that the mean of the difference is not 0).
But if I compute the 95% prediction interval (as mean +- 1.96 * stdev) I get that they overlap quite a lot, even if the confidence interval of the mean doesn't:
[1.29918, 1.45882]
[1.25772, 1.48428]
I think that comparing the 95% prediction intervals of the measurements is better than
testing whether the mean of the difference is 0, because if the difference is within measurement error/noise it is probably not practically significant.
Example output from bench (increased precision to %.4f):
Measuring: System Clock
Warming up
Estimating clock resolution (1.1093 us)
Estimating cost of timer call (69.7196 ns)
Benchmarking: rand
Ran 1 iterations in 1.6859 ms
Collecting 1000 samples, 1 iterations each, estimated time: 1.6859 s
N: 1000 Inter-quartile width:9.0599 us, Full range: (1.3508 ms,1.7518 ms)
Outliers: 14 (1.4%) Low Mild, 28 (2.8%) High Mild, 43 (4.3%) High Severe,
mean: 1.3790 ms, 95% CI: (1.3757 ms, 1.3810 ms)
std.dev.: 40.7229 us, 95% CI: (29.2276 us, 46.3408 us)
Benchmarking: rand
Ran 1 iterations in 1.4322 ms
Collecting 1000 samples, 1 iterations each, estimated time: 1.4322 s
N: 1000 Inter-quartile width:8.8215 us, Full range: (1.3420 ms,2.6969 ms)
Outliers: 26 (2.6%) High Mild, 46 (4.6%) High Severe,
mean: 1.3710 ms, 95% CI: (1.3671 ms, 1.3739 ms)
std.dev.: 57.7947 us, 95% CI: (31.3308 us, 83.0438 us)
The text was updated successfully, but these errors were encountered: