shouldn't compare just the means when deciding if a difference is significant #9

edwintorok · 2012-12-21T13:21:33Z

I've added the same function twice and run bench. Most of the time bench told me there is a difference between the two functions around 0.7-1%.
I dumped and loaded the values in GNU R, and I got similar results with t.test (i.e. that the mean of the difference is not 0).

But if I compute the 95% prediction interval (as mean +- 1.96 * stdev) I get that they overlap quite a lot, even if the confidence interval of the mean doesn't:
[1.29918, 1.45882]
[1.25772, 1.48428]

I think that comparing the 95% prediction intervals of the measurements is better than
testing whether the mean of the difference is 0, because if the difference is within measurement error/noise it is probably not practically significant.

Example output from bench (increased precision to %.4f):

Measuring: System Clock
Warming up
Estimating clock resolution (1.1093 us)
Estimating cost of timer call (69.7196 ns)
Benchmarking: rand
Ran 1 iterations in 1.6859 ms
Collecting 1000 samples, 1 iterations each, estimated time: 1.6859 s
N: 1000 Inter-quartile width:9.0599 us, Full range: (1.3508 ms,1.7518 ms)
Outliers: 14 (1.4%) Low Mild, 28 (2.8%) High Mild, 43 (4.3%) High Severe,
mean: 1.3790 ms, 95% CI: (1.3757 ms, 1.3810 ms)
std.dev.: 40.7229 us, 95% CI: (29.2276 us, 46.3408 us)

Benchmarking: rand
Ran 1 iterations in 1.4322 ms
Collecting 1000 samples, 1 iterations each, estimated time: 1.4322 s
N: 1000 Inter-quartile width:8.8215 us, Full range: (1.3420 ms,2.6969 ms)
Outliers: 26 (2.6%) High Mild, 46 (4.6%) High Severe,
mean: 1.3710 ms, 95% CI: (1.3671 ms, 1.3739 ms)
std.dev.: 57.7947 us, 95% CI: (31.3308 us, 83.0438 us)

edwintorok · 2012-12-21T13:23:52Z

Sample patch https://gist.github.com/4352787
alternatively there could be two summarize functions, one that cares only about the mean, and one that cares about prediction intervals.

thelema · 2012-12-22T01:09:09Z

So you're saying that it's enough for either distribution to be 95% likely to produce a value that's the mean of the other distribution? Hmm, I'm not a statistician, but I'm not quite satisfied with that test. Maybe I should find a statistician and see what they say. If you want to make a pull request adding this test as an option in addition to the mean test, I'd accept that.

edwintorok · 2012-12-22T11:53:00Z

An input from a statistician would certainly help.

The current test in bench is statistically correct: it tests whether the mean of the difference is 0, i.e. if the mean changed in a statistically significant way.
However when comparing running times of the two functions I'm not only interested if the mean changed, but also
if that change is significant given the measurement error/noise.

Here is an example:
mean: 1.3839 ms, 95% CI: (1.3825 ms, 1.3872 ms)
std.dev.: 33.3815 us, 95% CI: (21.0165 us, 39.5632 us)

mean: 1.3614 ms, 95% CI: (1.3607 ms, 1.3617 ms)
std.dev.: 8.0198 us, 95% CI: (5.8246 us, 9.7001 us)

The difference between the means is 22.5us, which is considered significant according to bench.
But the measurement errors are +- 65 us and 16us respectively, at least the first error is larger than the difference in means.
If I compute the standard deviation of the difference (sqrt(s1^2+s2^2)) I get 34us, 22.5us is still within that error.

It might make sense to use the standard error of the difference instead of the standard error of the mean of the difference in the t-test, i.e. test whether the difference is 0, not that the mean of the difference is 0, but I'm not sure about that.

To keep things simple lets calculate a prediction interval for the difference between the two measurements. If 0 is part of that interval, conclude that the difference is not significant.

edwintorok · 2012-12-22T11:54:39Z

Pull request here: #10

There are some other minor build fixes here that you might wish to pull separately (and then regenerate with oasis 0.3.0):
https://github.com/edwintorok/bench/commits/master

superbobry · 2013-05-09T15:38:54Z

I wonder why these build fixes weren't accepted, the version in master fails to build precisely because executables are missing a dependency on bench.

thelema · 2013-05-09T16:02:33Z

Failure to grab your build fixes from your repo. I've given you access to
the bench repo; feel free to push your fixes.

On Thu, May 9, 2013 at 8:38 AM, Sergei Lebedev [email protected]:

I wonder why this build fixes weren't accepted, the version in masterfails to build precisely because executables are missing a dependency on
bench.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/9#issuecomment-17671211
.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shouldn't compare just the means when deciding if a difference is significant #9

shouldn't compare just the means when deciding if a difference is significant #9

edwintorok commented Dec 21, 2012

edwintorok commented Dec 21, 2012

thelema commented Dec 22, 2012

edwintorok commented Dec 22, 2012

edwintorok commented Dec 22, 2012

superbobry commented May 9, 2013

thelema commented May 9, 2013

shouldn't compare just the means when deciding if a difference is significant #9

shouldn't compare just the means when deciding if a difference is significant #9

Comments

edwintorok commented Dec 21, 2012

edwintorok commented Dec 21, 2012

thelema commented Dec 22, 2012

edwintorok commented Dec 22, 2012

edwintorok commented Dec 22, 2012

superbobry commented May 9, 2013

thelema commented May 9, 2013