[Much useful information about confidence interval analysis omitted]
>
> Be careful out there, statistics is dangerous.
> - --
Indeed.
A big problem with the use of confidence intervals in analyzing the
differences in kernel performance via the benchmarks posted is that
we would be computing many confidence intervals simultaneously.
Without any adjustment, the confidence levels given by using the
standard statistical methods will be much greater than the actual
confidence levels.
Here's why. The confidence level (95%) may be thought of in the following
way: If we repeatedly compute a large number of independent 95% confidence
intervals, then 95% of them will contain the true parameter of interest
(the mean difference on the benchmark between the two kernels in this case)
and 5% will not.
Now suppose that we compute the benchmarks on two kernels, compute 95%
confidence intervals, and flag any confidence intervals that do not
contain the number 0 (since this suggests that the true mean difference
is nonzero). The previous paragraph tells us that EVEN IF THE KERNELS
ARE EXACTLY THE SAME, we'd expect that 5% of the intervals would be
flagged (incorrectly).
So what can we do? There are two options.
1. Use a procedure for computing the confidence intervals that takes into
account the fact that we're computing lots of intervals. There are many
methods for this; a simple example is given in Moore and McCabe,
"Introduction to the practice of statistics", 1989, p. 744.
2. Not try to perform formal statistical inference on the benchmarks.
Flag those with large differences (or better large standardized
differences, where we standardize by dividing by the estimated
standard error), and then try to figure out if there's something
that's changed in the kernel to cause such a change.
I would also suggest that this sort of discussion of the analysis of
the benchmarks (in general) be moved off the kernel mailing list for
now.
Vince Melfi
Dept. of Statistics and Probability
Michigan State University
vince@stt.msu.edu