Re: [RFC PATCH 00/10] x86: undwarf unwinder

From: Andy Lutomirski
Date: Thu Jun 01 2017 - 09:50:35 EST


On Thu, Jun 1, 2017 at 5:47 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> On Thu, Jun 01, 2017 at 02:17:21PM +0200, Peter Zijlstra wrote:
>> On Thu, Jun 01, 2017 at 06:58:20AM -0500, Josh Poimboeuf wrote:
>> > > Being able to generate more optimal code in the hottest code paths of the kernel
>> > > is the _real_, primary upstream kernel benefit of a different debuginfo method -
>> > > which has to be weighed against the pain of introducing a new unwinder. But this
>> > > submission does not talk about that aspect at all, which should be fixed I think.
>> >
>> > Actually I devoted an entire one-sentence paragraph to performance in
>> > the documentation:
>> >
>> > The simpler debuginfo format also enables the unwinder to be relatively
>> > fast, which is important for perf and lockdep.
>> >
>> > But I'll try to highlight that a little more.
>>
>> That's relative to a DWARF unwinder.
>
> Yes.
>
>> It doesn't appear to be possible to get anywhere near a frame-pointer
>> unwinder due to having to do this log(n) lookup for every single
>> frame.
>
> Hm, is there something faster, yet not substantially bigger? Hash?
> Trie?

You have, roughly, a set of (key_start, value) pairs where, for any
given key, you want to find the (key_start, value) with the largest
key_start that doesn't exceed key. Binary search gives you log_2(n)
queries, but its locality of reference sucks. Here are two
suggestions for improving it:

1. Change the data layout. Instead of having an array of undwarf
entries, have two parallel arrays, one with the ip addresses and one
with everything else. This has no effect on the amount of space used,
but it makes the part used during search more compact.

2. Your key space is fairly small and your table entries should be
reasonably uniformly distributed. Let the first IP you have unwind
data for be IP0. Make an array mapping (IP - IP0) / B to the index of
the unwind entry for that IP for some suitable block size B. Then, to
look up an IP, you'd find the indices of the unwind entries for (IP -
IP0) / B and (IP - IP0) / B + 1 and binary search between them. With
constant B, this gives you O(1) performance instead of O(log(n)).
With B = 1, it's very fast, but the table is huge. With B = 64k or
so, maybe you'd get a nice tradeoff of speedup vs size. (With
modules, you'd presumably first search an rbtree to find which
instance of this data structure you're using and then do the lookup.)

3. Use a B-tree. B-trees are simple if you don't need to deal with
insertion and deletion. Presumably you'd choose your internal node
size so each internal node is exactly 64 or 128 bytes for good cache
performance. This is still O(log(n)) and it uses more comparisons
than binary search, but you touch many fewer cache lines.

I expect that, if you do #1 and #2, you'd get excellent performance at
very little cost to the complexity of the code.