Re: [PATCH 0/4] jump label patches

From: Roland McGrath
Date: Tue Oct 06 2009 - 01:40:52 EST


I am, of course, fully in favor of this hack. This version raises a new
concern for me vs what we had discussed before. I don't know what the
conclusion about this should be, but I think it should be aired.

In the previous plan, we had approximately:

asm goto ("1:" P6_NOP5
".pushsection __jump_table\n"
_ASM_PTR "1b, %l[do_trace]\n"
".popsection" : : : do_trace);
if (0) { do_trace: ... tracing_path(); ... }
... hot_path(); ...

That is, the straight-line code path is a 5-byte nop. To enable the
"static if" at runtime, we replace that with a "jmp .Ldo_trace".
So, disabled:

0x1: nopl
0x6: hot path
...
0x100: ret # or jmp somewhere else, whatever
...
0x234: tracing path # never reached
...
0x250: jmp 0x6

and enabled:

0x1: jmp 0x234
0x6: hot path
...
0x100: ret
...
0x234: tracing path
...
0x250: jmp 0x6


In your new plan, instead we now have approximately:

asm goto ("1: jmp %l[dont_trace]\n"
".pushsection __jump_table\n"
_ASM_PTR "1b, %l[dont_trace]\n"
".popsection" : : : dont_trace);
... tracing path ...
dont_trace:
... hot_path(); ...

That is, we've inverted the sense of the control flow: the straight-line
code path is the tracing path, and in default "disabled" state we jump
around the tracing path to get to the hot path.
So, disabled:

0x1: jmp 0x1f
0x3: tracing path # never reached
...
0x1f: hot path
...
0x119: ret

and enabled:

0x1: jmp 0x3
0x3: tracing path
...
0x1f: hot path
...
0x119: ret


As I understand it, the point of the exercise is to optimize the "disabled"
case to as close as possible to what we'd get with no tracing path compiled
in at all. In the first example (with "nopl"), it's easy to see how that
is what we presume is pretty close to epsilon addition: the execution cost
of the 5-byte nop, plus the indirect effects of those 5 bytes polluting the
I-cache. We only really know when we measure, but that just seems likely
to be minimally obtrustive.

In the second example (with "jmp around"), I really wonder what the actual
overhead is. There's the cost of the jmp itself, plus maybe whatever extra
jumps do to branch predictions or pipelines or whatnots of which I know not
much, plus the entire tracing path being right there adjacent using up the
I-cache space that would otherwise be keeping more of the hot path hot.
I'm sure others on the list have more insight than I do into what the
specific performance impacts we can expect from one code sequence or the
other on various chips.

Of course, a first important point is what the actual compiled code
sequences look like. I'm hoping Richard (who implemented the compiler
feature for us) can help us with making sure our expectations jibe with the
code we'll really get. There's no benefit in optimizing our asm not to
introduce a jump into the hot path if the compiler actually generates the
tracing path first and gives the hot path a "jmp" around it anyway.

The code example above assumes that "if (0)" is enough for the compiler to
put that code fork (where the "do_trace:" label is) somewhere out of the
straight-line path rather than jumping around it. Going on the "belt and
suspenders" theory as to being thoroughly explicit to the compiler what we
intend, I'd go for:

if (__builtin_expect(0,0)) do_trace: __attribute__((cold)) { ... }

But we need Richard et al to tell us what actually makes a difference to
the compiler's optimizer, and will reliably continue to do so in the future.


Thanks,
Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/