Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shouldn't compare just the means when deciding if a difference is significant #9

Open
edwintorok opened this issue Dec 21, 2012 · 6 comments

Comments

@edwintorok
Copy link
Collaborator

I've added the same function twice and run bench. Most of the time bench told me there is a difference between the two functions around 0.7-1%.
I dumped and loaded the values in GNU R, and I got similar results with t.test (i.e. that the mean of the difference is not 0).

But if I compute the 95% prediction interval (as mean +- 1.96 * stdev) I get that they overlap quite a lot, even if the confidence interval of the mean doesn't:
[1.29918, 1.45882]
[1.25772, 1.48428]

I think that comparing the 95% prediction intervals of the measurements is better than
testing whether the mean of the difference is 0, because if the difference is within measurement error/noise it is probably not practically significant.

Example output from bench (increased precision to %.4f):

Measuring: System Clock
Warming up
Estimating clock resolution (1.1093 us)
Estimating cost of timer call (69.7196 ns)
Benchmarking: rand
Ran 1 iterations in 1.6859 ms
Collecting 1000 samples, 1 iterations each, estimated time: 1.6859 s
N: 1000 Inter-quartile width:9.0599 us, Full range: (1.3508 ms,1.7518 ms)
Outliers: 14 (1.4%) Low Mild, 28 (2.8%) High Mild, 43 (4.3%) High Severe,
mean: 1.3790 ms, 95% CI: (1.3757 ms, 1.3810 ms)
std.dev.: 40.7229 us, 95% CI: (29.2276 us, 46.3408 us)

Benchmarking: rand
Ran 1 iterations in 1.4322 ms
Collecting 1000 samples, 1 iterations each, estimated time: 1.4322 s
N: 1000 Inter-quartile width:8.8215 us, Full range: (1.3420 ms,2.6969 ms)
Outliers: 26 (2.6%) High Mild, 46 (4.6%) High Severe,
mean: 1.3710 ms, 95% CI: (1.3671 ms, 1.3739 ms)
std.dev.: 57.7947 us, 95% CI: (31.3308 us, 83.0438 us)

@edwintorok
Copy link
Collaborator Author

Sample patch https://gist.github.com/4352787
alternatively there could be two summarize functions, one that cares only about the mean, and one that cares about prediction intervals.

@thelema
Copy link
Owner

thelema commented Dec 22, 2012

So you're saying that it's enough for either distribution to be 95% likely to produce a value that's the mean of the other distribution? Hmm, I'm not a statistician, but I'm not quite satisfied with that test. Maybe I should find a statistician and see what they say. If you want to make a pull request adding this test as an option in addition to the mean test, I'd accept that.

@edwintorok
Copy link
Collaborator Author

An input from a statistician would certainly help.

The current test in bench is statistically correct: it tests whether the mean of the difference is 0, i.e. if the mean changed in a statistically significant way.
However when comparing running times of the two functions I'm not only interested if the mean changed, but also
if that change is significant given the measurement error/noise.

Here is an example:
mean: 1.3839 ms, 95% CI: (1.3825 ms, 1.3872 ms)
std.dev.: 33.3815 us, 95% CI: (21.0165 us, 39.5632 us)

mean: 1.3614 ms, 95% CI: (1.3607 ms, 1.3617 ms)
std.dev.: 8.0198 us, 95% CI: (5.8246 us, 9.7001 us)

The difference between the means is 22.5us, which is considered significant according to bench.
But the measurement errors are +- 65 us and 16us respectively, at least the first error is larger than the difference in means.
If I compute the standard deviation of the difference (sqrt(s1^2+s2^2)) I get 34us, 22.5us is still within that error.

It might make sense to use the standard error of the difference instead of the standard error of the mean of the difference in the t-test, i.e. test whether the difference is 0, not that the mean of the difference is 0, but I'm not sure about that.

To keep things simple lets calculate a prediction interval for the difference between the two measurements. If 0 is part of that interval, conclude that the difference is not significant.

@edwintorok
Copy link
Collaborator Author

Pull request here: #10

There are some other minor build fixes here that you might wish to pull separately (and then regenerate with oasis 0.3.0):
https://github.com/edwintorok/bench/commits/master

@superbobry
Copy link

I wonder why these build fixes weren't accepted, the version in master fails to build precisely because executables are missing a dependency on bench.

@thelema
Copy link
Owner

thelema commented May 9, 2013

Failure to grab your build fixes from your repo. I've given you access to
the bench repo; feel free to push your fixes.

On Thu, May 9, 2013 at 8:38 AM, Sergei Lebedev [email protected]:

I wonder why this build fixes weren't accepted, the version in masterfails to build precisely because executables are missing a dependency on
bench.


Reply to this email directly or view it on GitHubhttps://github.com//issues/9#issuecomment-17671211
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants