Re: [patch 9/9] Scheduler profiling - Use conditional calls

From: Mathieu Desnoyers
Date: Fri Jun 01 2007 - 12:33:44 EST


* Andrew Morton (akpm@xxxxxxxxxxxxxxxxxxxx) wrote:
> On Fri, 1 Jun 2007 11:54:13 -0400 Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxx> wrote:
>
> > * Andrew Morton (akpm@xxxxxxxxxxxxxxxxxxxx) wrote:
> > > On Wed, 30 May 2007 10:00:34 -0400
> > > Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxx> wrote:
> > >
> > > > @@ -2990,7 +2991,8 @@
> > > > print_irqtrace_events(prev);
> > > > dump_stack();
> > > > }
> > > > - profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> > > > + cond_call(profile_on,
> > > > + profile_hit(SCHED_PROFILING, __builtin_return_address(0)));
> > > >
> > >
> > > That's looking pretty neat. Do you have any before-and-after performance
> > > figures for i386 and for a non-optimised architecture?
> >
> > Sure, here is the result of a small test comparing:
> > 1 - Branch depending on a cache miss (has to fetch in memory, caused by a 128
> > bytes stride)). This is the test that is likely to look like what
> > side-effect the original profile_hit code was causing, under the
> > assumption that the kernel is already using L1 and L2 caches at
> > their full capacity and that a supplementary data load would cause
> > cache trashing.
> > 2 - Branch depending on L1 cache hit. Just for comparison.
> > 3 - Branch depending on a load immediate in the instruction stream.
> >
> > It has been compiled with gcc -O2. Tests done on a 3GHz P4.
> >
> > In the first test series, the branch is not taken:
> >
> > number of tests : 1000
> > number of branches per test : 81920
> > memory hit cycles per iteration (mean) : 48.252
> > L1 cache hit cycles per iteration (mean) : 16.1693
> > instruction stream based test, cycles per iteration (mean) : 16.0432
> >
> >
> > In the second test series, the branch is taken and an integer is
> > incremented within the block:
> >
> > number of tests : 1000
> > number of branches per test : 81920
> > memory hit cycles per iteration (mean) : 48.2691
> > L1 cache hit cycles per iteration (mean) : 16.396
> > instruction stream based test, cycles per iteration (mean) : 16.0441
> >
> > Therefore, the memory fetch based test seems to be 200% slower than the
> > load immediate based test.
>
> Confused. From what did you calculate that 200%?
>
> > (I am adding these results to the documentation)
>
> Good, thanks.


(48.2691-16.0441)/16.0441 = 2.00

Which means that it is 200% slower to run this test while fetching the
branch condition from main memory rather than using the load immediate.

We could also put it like this : the speedup of the load immediate over
the memory fetch is 3.

48.2691/16.0441 = 3.00

Is there a preferred way to present these results in the documentation ?

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/