[PATCH] x86: Align jump targets to 1 byte boundaries

From: Ingo Molnar
Date: Fri Apr 10 2015 - 08:09:19 EST



* Ingo Molnar <mingo@xxxxxxxxxx> wrote:

> So restructure the loop a bit, to get much tighter code:
>
> 0000000000000030 <mutex_spin_on_owner.isra.5>:
> 30: 55 push %rbp
> 31: 65 48 8b 14 25 00 00 mov %gs:0x0,%rdx
> 38: 00 00
> 3a: 48 89 e5 mov %rsp,%rbp
> 3d: 48 39 37 cmp %rsi,(%rdi)
> 40: 75 1e jne 60 <mutex_spin_on_owner.isra.5+0x30>
> 42: 8b 46 28 mov 0x28(%rsi),%eax
> 45: 85 c0 test %eax,%eax
> 47: 74 0d je 56 <mutex_spin_on_owner.isra.5+0x26>
> 49: f3 90 pause
> 4b: 48 8b 82 10 c0 ff ff mov -0x3ff0(%rdx),%rax
> 52: a8 08 test $0x8,%al
> 54: 74 e7 je 3d <mutex_spin_on_owner.isra.5+0xd>
> 56: 31 c0 xor %eax,%eax
> 58: 5d pop %rbp
> 59: c3 retq
> 5a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
> 60: b8 01 00 00 00 mov $0x1,%eax
> 65: 5d pop %rbp
> 66: c3 retq

Btw., totally off topic, the following NOP caught my attention:

> 5a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)

That's a dead NOP that boats the function a bit, added for the 16 byte
alignment of one of the jump targets.

I realize that x86 CPU manufacturers recommend 16-byte jump target
alignments (it's in the Intel optimization manual), but the cost of
that is very significant:

text data bss dec filename
12566391 1617840 1089536 15273767 vmlinux.align.16-byte
12224951 1617840 1089536 14932327 vmlinux.align.1-byte

By using 1 byte jump target alignment (i.e. no alignment at all) we
get an almost 3% reduction in kernel size (!) - and a probably similar
reduction in I$ footprint.

So I'm wondering, is the 16 byte jump target optimization suggestion
really worth this price? The patch below boots fine and I've not
measured any noticeable slowdown, but I've not tried hard.

Now, the usual justification for jump target alignment is the
following: with 16 byte instruction-cache cacheline sizes, if a
forward jump is aligned to cacheline boundary then prefetches will
start from a new cacheline.

But I think that argument is flawed for typical optimized kernel code
flows: forward jumps often go to 'cold' (uncommon) pieces of code, and
aligning cold code to cache lines does not bring a lot of advantages
(they are uncommon), while it causes collateral damage:

- their alignment 'spreads out' the cache footprint, it shifts
followup hot code further out

- plus it slows down even 'cold' code that immediately follows 'hot'
code (like in the above case), which could have benefited from the
partial cacheline that comes off the end of hot code.

What do you guys think about this? I think we should seriously
consider relaxing our alignment defaults.

Thanks,

Ingo

==================================>