Re: [tip:x86/asm] x86: Pack function addresses tightly as well

From: Ingo Molnar
Date: Sun May 17 2015 - 03:10:05 EST



* Ingo Molnar <mingo@xxxxxxxxxx> wrote:

> > Put another way: I suspect this is more likely to hurt, and less
> > likely to help than the others.
>
> Yeah, indeed.
>
> So my thinking was that it would help, because:
>
> - There's often locality of reference between functions: we often
> have a handful of hot functions that are sitting next to each
> other and they could thus be packed closer to each other this way,
> creating a smaller net I$ footprint.
>
> - We have a handful of 'clusters' or small and often hot functions,
> especially in the locking code:
>
> ffffffff81893080 T _raw_spin_unlock_irqrestore
> ffffffff81893090 T _raw_read_unlock_irqrestore
> ffffffff818930a0 T _raw_write_unlock_irqrestore
> ffffffff818930b0 T _raw_spin_trylock_bh
> ffffffff81893110 T _raw_spin_unlock_bh
> ffffffff81893130 T _raw_read_unlock_bh
> ffffffff81893150 T _raw_write_unlock_bh
> ffffffff81893170 T _raw_read_trylock
> ffffffff818931a0 T _raw_write_trylock
> ffffffff818931d0 T _raw_read_lock_irqsave
> ffffffff81893200 T _raw_write_lock_irqsave
> ffffffff81893230 T _raw_spin_lock_bh
> ffffffff81893270 T _raw_spin_lock_irqsave
> ffffffff818932c0 T _raw_write_lock
> ffffffff818932e0 T _raw_write_lock_irq
> ffffffff81893310 T _raw_write_lock_bh
> ffffffff81893340 T _raw_spin_trylock
> ffffffff81893380 T _raw_read_lock
> ffffffff818933a0 T _raw_read_lock_irq
> ffffffff818933c0 T _raw_read_lock_bh
> ffffffff818933f0 T _raw_spin_lock
> ffffffff81893430 T _raw_spin_lock_irq
> ffffffff81893450
>
> That's 976 bytes total if 16 bytes aligned.
>
> With function packing, they compress into:
>
> ffffffff817f2458 T _raw_spin_unlock_irqrestore
> ffffffff817f2463 T _raw_read_unlock_irqrestore
> ffffffff817f2472 T _raw_write_unlock_irqrestore
> ffffffff817f247d T _raw_read_unlock_bh
> ffffffff817f2498 T _raw_write_unlock_bh
> ffffffff817f24af T _raw_spin_unlock_bh
> ffffffff817f24c6 T _raw_read_trylock
> ffffffff817f24ef T _raw_write_trylock
> ffffffff817f250e T _raw_spin_lock_bh
> ffffffff817f2536 T _raw_read_lock_irqsave
> ffffffff817f255e T _raw_write_lock_irqsave
> ffffffff817f2588 T _raw_spin_lock_irqsave
> ffffffff817f25be T _raw_spin_trylock_bh
> ffffffff817f25f6 T _raw_spin_trylock
> ffffffff817f2615 T _raw_spin_lock
> ffffffff817f2632 T _raw_spin_lock_irq
> ffffffff817f2650 T _raw_write_lock
> ffffffff817f266b T _raw_write_lock_irq
> ffffffff817f2687 T _raw_write_lock_bh
> ffffffff817f26ad T _raw_read_lock
> ffffffff817f26c6 T _raw_read_lock_bh
> ffffffff817f26ea T _raw_read_lock_irq
> ffffffff817f2704
>
> That's 684 bytes - a very stark difference that will show up in
> better I$ footprint even if usage is sparse.
>
> OTOH, on the flip side, their ordering is far from ideal, so for
> example the rarely used 'trylock' variants are mixed into the
> middle, and the way we mix rwlock with spinlock ops isn't
> very pretty either.
>
> So we could reduce alignment for just the locking APIs, via per
> .o cflags in the Makefile, if packing otherwise hurts the common
> case.

Another datapoint: despite the widespread stereotype of bloated,
kernel stack guzzling kernel functions, most kernel functions are
pretty small and fit into a single (or at most two) cachelines.

Here's a histogram of x86-64 defconfig function sizes (in bytes, cut
off at around 1200 bytes):

600 ++----------------+------------------+-----------------+-----------------+------------------+----------------++
+ + + + kernel function sizes histogram A +
| |
| |
| |
| |
500 +AA ++
| |
A |
| |
| |
| |
400 ++ ++
| |
|A |
| |
| |
300 +AAA ++
| A |
|A |
|A |
| |
| AAA |
200 +AAAA ++
| AAAAA |
| AAAAAA |
| A AAAA A |
|A A AAAAAAA |
| AAAAA |
100 +A AAAA ++
| AAAAAAA |
| AAAAAAAAAA |
| AAAAAAAAAAAA |
|A AAAAAAAAAAAA AAAA |
+ + AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AA A A A + A + +
0 AA----------------+------------------AA-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA--------------++
0 200 400 600 800 1000 1200


Zoomed to just the first 200 bytes:

600 ++--------------------------+--------------------------+---------------------------+-------------------------++
+ + + kernel function sizes histogram A +
| |
| |
| |
| |
500 ++ A A ++
| |
A |
| |
| |
| |
400 ++ ++
| |
| A |
| |
| |
300 ++ A A A ++
| A |
| A |
| A |
| |
| A A A A |
200 ++ A A A AAA A ++
| A A AA A A A A |
| A A A AA AAAA A A A A |
| A A A AA AAAA AAAAAA A AA |
| A A A A A A A A A AAAA A A AAA AAA A |
| AA A A AA AAAA AAAAA AA |
100 ++ A A A A AA AAAAA AA A ++
| A AA AA AA AAA AAAAA AA A A |
| A AA AAAA AAAAAAAAA AAAAAA A A|
| A AAAAA AA AAA
| AA A |
+ + + + +
0 +AAAA-----------------------+--------------------------+---------------------------+-------------------------++
0 50 100 150 200


The median function size is around 1 cacheline (64-byte one), ~80%
fitting into two cachelines, with a big peak for very small functions
that make up something like 20% of all functions - so packing makes
sense unless the actual functions being executed are more fragmented
than I'd estimate around 50%.

Which is still a significant hurdle.

Plus there are a truckload of other details, like how exactly the flow
of execution goes within the function (which part is hot). For example
locking APIs tend to return early after a relatively straight path of
execution, making the effective hot function length even smaller than
the median.

All in one: there's no clear winner, so there's no way around actually
measuring it ... ;-)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/