Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions

From: Denys Vlasenko
Date: Wed May 20 2015 - 07:31:11 EST


On 05/19/2015 11:38 PM, Ingo Molnar wrote:
> Here's the result from the Intel system:
>
> linux-falign-functions=_64-bytes/res.txt: 647,853,942 L1-icache-load-misses ( +- 0.07% ) (100.00%)
> linux-falign-functions=128-bytes/res.txt: 669,401,612 L1-icache-load-misses ( +- 0.08% ) (100.00%)
> linux-falign-functions=_32-bytes/res.txt: 685,969,043 L1-icache-load-misses ( +- 0.08% ) (100.00%)
> linux-falign-functions=256-bytes/res.txt: 699,130,207 L1-icache-load-misses ( +- 0.06% ) (100.00%)
> linux-falign-functions=512-bytes/res.txt: 699,130,207 L1-icache-load-misses ( +- 0.06% ) (100.00%)
> linux-falign-functions=_16-bytes/res.txt: 706,080,917 L1-icache-load-misses [vanilla kernel] ( +- 0.05% ) (100.00%)
> linux-falign-functions=__1-bytes/res.txt: 724,539,055 L1-icache-load-misses ( +- 0.31% ) (100.00%)
> linux-falign-functions=__4-bytes/res.txt: 725,707,848 L1-icache-load-misses ( +- 0.12% ) (100.00%)
> linux-falign-functions=__8-bytes/res.txt: 726,543,194 L1-icache-load-misses ( +- 0.04% ) (100.00%)
> linux-falign-functions=__2-bytes/res.txt: 738,946,179 L1-icache-load-misses ( +- 0.12% ) (100.00%)
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 921,910,808 L1-icache-load-misses ( +- 0.05% ) (100.00%)
>
> The optimal I$ miss rate is at 64 bytes - which is 9% better than the
> default kernel's I$ miss rate at 16 bytes alignment.
>
> The 128/256/512 bytes numbers show an increasing amount of cache
> misses: probably due to the artificially reduced associativity of the
> caching.
>
> Surprisingly there's a rather marked improvement in elapsed time as
> well:
>
> linux-falign-functions=_64-bytes/res.txt: 7.154816369 seconds time elapsed ( +- 0.03% )
> linux-falign-functions=_32-bytes/res.txt: 7.231074263 seconds time elapsed ( +- 0.12% )
> linux-falign-functions=__8-bytes/res.txt: 7.292203002 seconds time elapsed ( +- 0.30% )
> linux-falign-functions=128-bytes/res.txt: 7.314226040 seconds time elapsed ( +- 0.29% )
> linux-falign-functions=_16-bytes/res.txt: 7.333597250 seconds time elapsed [vanilla kernel] ( +- 0.48% )
> linux-falign-functions=__1-bytes/res.txt: 7.367139908 seconds time elapsed ( +- 0.28% )
> linux-falign-functions=__4-bytes/res.txt: 7.371721930 seconds time elapsed ( +- 0.26% )
> linux-falign-functions=__2-bytes/res.txt: 7.410033936 seconds time elapsed ( +- 0.34% )
> linux-falign-functions=256-bytes/res.txt: 7.507029637 seconds time elapsed ( +- 0.07% )
> linux-falign-functions=512-bytes/res.txt: 7.507029637 seconds time elapsed ( +- 0.07% )
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 8.531418784 seconds time elapsed ( +- 0.19% )
>
> the workload got 2.5% faster - which is pretty nice! This result is 5+
> standard deviations above the noise of the measurement.
>
> Side note: see how catastrophic -Os (CC_OPTIMIZE_FOR_SIZE=y)
> performance is: markedly higher cache miss rate despite a 'smaller'
> kernel, and the workload is 16.3% slower (!).
>
> Part of the -Os picture is that the -Os kernel is executing much more
> instructions:
>
> linux-falign-functions=_64-bytes/res.txt: 11,851,763,357 instructions ( +- 0.01% )
> linux-falign-functions=__1-bytes/res.txt: 11,852,538,446 instructions ( +- 0.01% )
> linux-falign-functions=_16-bytes/res.txt: 11,854,159,736 instructions ( +- 0.01% )
> linux-falign-functions=__4-bytes/res.txt: 11,864,421,708 instructions ( +- 0.01% )
> linux-falign-functions=__8-bytes/res.txt: 11,865,947,941 instructions ( +- 0.01% )
> linux-falign-functions=_32-bytes/res.txt: 11,867,369,566 instructions ( +- 0.01% )
> linux-falign-functions=128-bytes/res.txt: 11,867,698,477 instructions ( +- 0.01% )
> linux-falign-functions=__2-bytes/res.txt: 11,870,853,247 instructions ( +- 0.01% )
> linux-falign-functions=256-bytes/res.txt: 11,876,281,686 instructions ( +- 0.01% )
> linux-falign-functions=512-bytes/res.txt: 11,876,281,686 instructions ( +- 0.01% )
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 14,318,175,358 instructions ( +- 0.01% )
>
> 21.2% more instructions executed ... that cannot go well.
>
> So this should be a reminder that it's effective I$ footprint and
> number of instructions executed that matters to performance, not
> kernel size alone. With current GCC -Os should only be used on
> embedded systems where one is willing to make the kernel 10%+ slower,
> in exchange for a 20% smaller kernel.

Can you post your .config for the test?
If you have CONFIG_OPTIMIZE_INLINING=y in your -Os test,
consider re-testing with it turned off.
You may be seeing this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122


> The AMD system, with a starkly different x86 microarchitecture, is
> showing similar characteristics:
>
> linux-falign-functions=_64-bytes/res-amd.txt: 108,886,550 L1-icache-load-misses ( +- 0.10% ) (100.00%)
> linux-falign-functions=_32-bytes/res-amd.txt: 110,433,214 L1-icache-load-misses ( +- 0.15% ) (100.00%)
> linux-falign-functions=__1-bytes/res-amd.txt: 113,623,200 L1-icache-load-misses ( +- 0.17% ) (100.00%)
> linux-falign-functions=128-bytes/res-amd.txt: 119,100,216 L1-icache-load-misses ( +- 0.22% ) (100.00%)
> linux-falign-functions=_16-bytes/res-amd.txt: 122,916,937 L1-icache-load-misses ( +- 0.15% ) (100.00%)
> linux-falign-functions=__8-bytes/res-amd.txt: 123,810,566 L1-icache-load-misses ( +- 0.18% ) (100.00%)
> linux-falign-functions=__2-bytes/res-amd.txt: 124,337,908 L1-icache-load-misses ( +- 0.71% ) (100.00%)
> linux-falign-functions=__4-bytes/res-amd.txt: 125,221,805 L1-icache-load-misses ( +- 0.09% ) (100.00%)
> linux-falign-functions=256-bytes/res-amd.txt: 135,761,433 L1-icache-load-misses ( +- 0.18% ) (100.00%)
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt: 159,918,181 L1-icache-load-misses ( +- 0.10% ) (100.00%)
> linux-falign-functions=512-bytes/res-amd.txt: 170,307,064 L1-icache-load-misses ( +- 0.26% ) (100.00%)
>
> 64 bytes is a similar sweet spot. Note that the penalty at 512 bytes
> is much steeper than on Intel systems: cache associativity is likely
> lower on this AMD CPU.
>
> Interestingly the 1 byte alignment result is still pretty good on AMD
> systems - and I used the exact same kernel image on both systems, so
> the layout of the functions is exactly the same.
>
> Elapsed time is noisier, but shows a similar trend:
>
> linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed ( +- 2.74% )
> linux-falign-functions=128-bytes/res-amd.txt: 1.932961745 seconds time elapsed ( +- 2.18% )
> linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed ( +- 1.84% )
> linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed ( +- 2.15% )
> linux-falign-functions=_32-bytes/res-amd.txt: 1.962074787 seconds time elapsed ( +- 2.38% )
> linux-falign-functions=_16-bytes/res-amd.txt: 2.000941789 seconds time elapsed ( +- 1.18% )
> linux-falign-functions=__4-bytes/res-amd.txt: 2.002305627 seconds time elapsed ( +- 2.75% )
> linux-falign-functions=256-bytes/res-amd.txt: 2.003218532 seconds time elapsed ( +- 3.16% )
> linux-falign-functions=__2-bytes/res-amd.txt: 2.031252839 seconds time elapsed ( +- 1.77% )
> linux-falign-functions=512-bytes/res-amd.txt: 2.080632439 seconds time elapsed ( +- 1.06% )
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt: 2.346644318 seconds time elapsed ( +- 2.19% )
>
> 64 bytes alignment is the sweet spot here as well, it's 3.7% faster
> than the default 16 bytes alignment.

In AMD, 64 bytes win too, yes, but by a *very* small margin.
8 bytes and 1 byte alignments have basically same timings,
and both take what, +0.63% of time longer to run?

linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed
linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed
linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed

I wouldn't say that it's the same as Intel. There the difference between 64 byte
alignment and no alignment at all is five times larger than for AMD, it's +3%:

linux-falign-functions=_64-bytes/res.txt: 7.154816369 seconds time elapsed
linux-falign-functions=_32-bytes/res.txt: 7.231074263 seconds time elapsed
linux-falign-functions=__8-bytes/res.txt: 7.292203002 seconds time elapsed
linux-falign-functions=_16-bytes/res.txt: 7.333597250 seconds time elapsed
linux-falign-functions=__1-bytes/res.txt: 7.367139908 seconds time elapsed

> So based on those measurements, I think we should do the exact
> opposite of my original patch that reduced alignment to 1 bytes, and
> increase kernel function address alignment from 16 bytes to the
> natural cache line size (64 bytes on modern CPUs).

> + #
> + # Allocate a separate cacheline for every function,
> + # for optimal instruction cache packing:
> + #
> + KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_FUNCTION_ALIGNMENT)

How about -falign-functions=CONFIG_X86_FUNCTION_ALIGNMENT/2 + 1 instead?

This avoids pathological cases where function starting just a few bytes after
64-bytes boundary gets aligned to the next one, wasting ~60 bytes.
--
vda

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/