Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions

From: Denys Vlasenko
Date: Wed May 20 2015 - 08:22:55 EST

Next message: tip-bot for Jiri Olsa: "[tip:perf/core] perf tools: Fix dwarf-aux.c compilation on i386"
Previous message: tip-bot for Arnaldo Carvalho de Melo: "[tip:perf/core] perf cgroup: Use atomic.h for refcounting"
In reply to: Linus Torvalds: "Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions"
Next in thread: Ingo Molnar: "Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 05/20/2015 02:47 AM, Linus Torvalds wrote:
> On Tue, May 19, 2015 at 2:38 PM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>>
>> The optimal I$ miss rate is at 64 bytes - which is 9% better than the
>> default kernel's I$ miss rate at 16 bytes alignment.
>
> Ok, these numbers looks reasonable (which is, of course, defined as
> "meets Linus' expectations"), so I like it.
>
> At the same time, I have to admit that I abhor a 64-byte function
> alignment, when we have a fair number of functions that are (much)
> smaller than that.
>
> Is there some way to get gcc to take the size of the function into
> account? Because aligning a 16-byte or 32-byte function on a 64-byte
> alignment is just criminally nasty and wasteful.
>
> From your numbers the 64-byte alignment definitely makes sense in
> general, but I really think it would be much nicer if we could get
> something like "align functions to their power-of-two size rounded up,
> up to a maximum of 64 bytes"

Well, that would be a bit hard to implement for gcc, at least in its traditional
mode where it emits assembly source, not machine code.

However, not all is lost.

I was thinking about Ingo's AMD results:

linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed
linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed
linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed

AMD is almost perfect. Having no alignment at all still works
very well. Almost perfect. Where "almost" comes from?

I bet it comes from the small fraction of functions which got unlucly
enough to have their first instruction split by 64-byte boundary.

If we would be able to avoid just this corner case, that would help a lot.

And GNU as has means to do that!
See https://sourceware.org/binutils/docs/as/P2align.html

.p2align N1,FILL,N3

"The third expression is also absolute, and is also optional.
If it is present, it is the maximum number of bytes that should
be skipped by this alignment directive."

So what we need is to put something like ".p2align 64,,7"
before every function.

(
Why 7?

defconfig vmlinux (w/o FRAME_POINTER) has 42141 functions.
6923 of them have 1st insn 5 or more bytes long,
5841 of them have 1st insn 6 or more bytes long,
5095 of them have 1st insn 7 or more bytes long,
786 of them have 1st insn 8 or more bytes long,
548 of them have 1st insn 9 or more bytes long,
375 of them have 1st insn 10 or more bytes long,
73 of them have 1st insn 11 or more bytes long,
one of them has 1st insn 12 bytes long:
this "heroic" instruction is in local_touch_nmi()
65 48 c7 05 44 3c 00 7f 00 00 00 00
movq $0x0,%gs:0x7f003c44(%rip)

Thus ensuring that at least seven first bytes do not cross
64-byte boundary would cover >98% of all functions.
)

gcc can't do that right now. With -falign-functions=N,
it emits ".p2align next_power_of_2(N),,N-1"

We need to make it just a tiny bit smarter.

> We'd need toolchain help to do saner alignment.

Yep.
I'm going to create a gcc BZ with a feature request,
unless you disagree with my musings above.

--
vda

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: tip-bot for Jiri Olsa: "[tip:perf/core] perf tools: Fix dwarf-aux.c compilation on i386"
Previous message: tip-bot for Arnaldo Carvalho de Melo: "[tip:perf/core] perf cgroup: Use atomic.h for refcounting"
In reply to: Linus Torvalds: "Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions"
Next in thread: Ingo Molnar: "Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]