[RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions

From: Ingo Molnar
Date: Tue May 19 2015 - 17:38:38 EST

Next message: Sebastian Andrzej Siewior: "[ANNOUNCE] 4.0.4-rt1"
Previous message: Andrew Morton: "Re: [PATCH 00/23] userfaultfd v4"
In reply to: Denys Vlasenko: "Re: [tip:x86/asm] x86: Pack function addresses tightly as well"
Next in thread: Linus Torvalds: "Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Ingo Molnar <mingo@xxxxxxxxxx> wrote:

> This function packing argument fails:
>
> - for large functions that are physically fragmented
>
> - if less than half of all functions in a hot workload are
> packed together. This might be the common case in fact.
>
> - even if functions are technically 'packed' next to each other,
> this only works for small functions: larger functions typically
> are hotter near their heads, with unlikely codepaths being in
> their tails.
>
> > Size matters, but size matters mainly from an I$ standpoint, not
> > from some absolute 'big is bad" issue.
>
> Absolutely.
>
> > [...] Also, even when size matters, performance matters too. I do
> > want performance numbers. Is this measurable?
>
> Will try to measure this. I'm somewhat sceptical that I'll be able
> to measure any signal: alignment effects are very hard to measure on
> x86, especially on any realistic workload.

So I spent the better part of today trying to measure all this. As
expected it's very hard to measure such alignment effects: anything
that misses the cache and is a real workload tends to be rather noisy.

After a couple of fruitless attempts I wrote the test program below.

It tries to create a more or less meaningful macro-workload that
executes a lot of diverse kernel functions, while still having
relatively low data footprint:

/*
* Copyright (C) 2015, Ingo Molnar <mingo@xxxxxxxxxx>
*/
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <assert.h>
#include <stdlib.h>
#include <locale.h>
#include <sys/mman.h>
#include <sys/stat.h>

#define BUG_ON(cond) assert(!(cond))

#define DEFAULT_LOOPS 100000

int main(int argc, char **argv)
{
const char *file_name = "./tmp-test-creat.txt";
struct stat stat_buf;
long loops;
long i;
int ret;

setlocale(LC_ALL, "");

if (argc > 1)
loops = atol(argv[1]);
else
loops = DEFAULT_LOOPS;

printf("# VFS-mix: creat()+stat()+mmap()+munmap()+close()+unlink()ing a file in $(pwd) %'ld times ... ", loops);
fflush(stdout);

for (i = 0; i < loops; i++) {
int fd;

fd = creat(file_name, 00700);
BUG_ON(fd < 0);

ret = lstat(file_name, &stat_buf);
BUG_ON(ret != 0);

ret = lseek(fd, 4095, SEEK_SET);
BUG_ON(ret != 4095);

close(fd);

fd = open(file_name, O_RDWR|O_CREAT|O_TRUNC);
BUG_ON(fd < 0);

{
char c = 1;

ret = write(fd, &c, 1);
BUG_ON(ret != 1);
}

{
char *mmap_buf = (char *)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

BUG_ON(mmap_buf == (void *)-1L);

mmap_buf[0] = 1;

ret = munmap(mmap_buf, 4096);
BUG_ON(ret != 0);
}

close(fd);

ret = unlink(file_name);
BUG_ON(ret != 0);
}

printf("done.\n");

return 0;
}

This is a pretty silly test in itself that re-creates the same file
again and again and does some VFS and MM ops on it, then deletes it.
But the nice thing about it is that in every iteration, if ran on a
journalling filesystem such as ext4, it touches:

- the VFS
- various filesystem low level functions
- various IO/block layer functions
- low level IO driver
- memory allocators (slub and page allocator)
- IRQ handling
- page fault handling
- mmap() paths
- syscall entry/exit path
- scheduling
- the RCU subsystem
- the workqueue subsystem
- the perf events subsystem

That's a ton of activity: running this on ext4 executes around 1,500
unique kernel functions (!), around 118,000 instructions per iteration
- with relatively small data footprint.

The profile is pretty flat, with almost 1,000 functions being beyond
the 0.01% overhead threshold.

This test workload has a non-trivial I$ footprint: with 1,500
functions and the median kernel function size of around 150 bytes,
it's about 220k of text executed - well beyond L1 instruction cache
sizes.

So this is a test that has a realistic instruction cache footprint,
but has micro-benchmark data footprint properties - which reduces
noise while still excercising the I$ effect we seek.

To further reduce measurement noise, I ran this on a system with SSD
disks, so the IRQ and block IO workload is almost synchronous and the
CPU is fully utilized.

Then I bound the workload to a single CPU, with prio SCHED_FIFO:90,
and ran it the following way:

taskset 1 chrt -f 99 perf stat -a -C 0 --pre sync --repeat 10 \
-e L1-icache-load-misses \
-e instructions \
-e context-switches \
./test-vfs

Then I built 11 kernels, each with different function alignment
settings:

fomalhaut:~/linux> size linux-*/vmlinux | sort -n
text data bss dec filename
12150606 2565544 1634304 16350454 linux-____CC_OPTIMIZE_FOR_SIZE=y/vmlinux
13534242 2579560 1634304 17748106 linux-falign-functions=__1-bytes/vmlinux
13554530 2579560 1634304 17768394 linux-falign-functions=__2-bytes/vmlinux
13590946 2579560 1634304 17804810 linux-falign-functions=__4-bytes/vmlinux
13658786 2579560 1634304 17872650 linux-falign-functions=__8-bytes/vmlinux
13799602 2579560 1634304 18013466 linux-falign-functions=_16-bytes/vmlinux
14075330 2579560 1634304 18289194 linux-falign-functions=_32-bytes/vmlinux
14664898 2579560 1634304 18878762 linux-falign-functions=_64-bytes/vmlinux
15980994 2579560 1634304 20194858 linux-falign-functions=128-bytes/vmlinux
19038018 2591848 1634304 23264170 linux-falign-functions=256-bytes/vmlinux
26391106 2591848 1634304 30617258 linux-falign-functions=512-bytes/vmlinux

(I've added the -Os kernel as a comparison.)

A single run results in output like:

Performance counter stats for 'system wide' (10 runs):

649,398,972 L1-icache-load-misses ( +- 0.09% ) (100.00%)
11,875,234,916 instructions ( +- 0.00% )
300,038 context-switches ( +- 0.00% )

7.198533444 seconds time elapsed ( +- 0.28% )

The 'instructions' and 'context-switches' metrics can be used to
cross-check that the run was undisturbed and that it executed exactly
the same workload as other runs.

'time elapsed' is rather noisy, while 'L1-icache-load-misses' is the
L1 instruction cache misses metric that is the most interesting one:
it's a proxy metric for effective I$ footprint of the workload: if the
footprint is larger then the number of misses goes up, if it's tigher
then the number of misses goes down.

I ran this on two systems:

... an Intel system:

vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
stepping : 7

cache size : 25600 KB
cache_alignment : 64

... and an AMD system:

vendor_id : AuthenticAMD
cpu family : 21
model : 1
model name : AMD Opteron(tm) Processor 6278
stepping : 2

cache size : 2048 KB
cache_alignment : 64

The CPU frequencies were set to the max on both systems, to reduce
power throttling noise and run-to-run skew.

The AMD system did not have SSDs, so there I used tmpfs: this reduced
the cache footprint to about 30% of that of the Intel system. The
alignment effect was still borderline measurable.

I ran the test 10 times on the AMD system (so 100 runs), and 3 times
on the Intel system (which took longer due to the file touching
ext4/SSD) - i.e. 30 runs.

I picked the best result from the tests, to reduce noise from workload
perturbations and from data cache layout effects. (If I take the
average values then the effect is still visible, but noisier.)

Here's the result from the Intel system:

linux-falign-functions=_64-bytes/res.txt: 647,853,942 L1-icache-load-misses ( +- 0.07% ) (100.00%)
linux-falign-functions=128-bytes/res.txt: 669,401,612 L1-icache-load-misses ( +- 0.08% ) (100.00%)
linux-falign-functions=_32-bytes/res.txt: 685,969,043 L1-icache-load-misses ( +- 0.08% ) (100.00%)
linux-falign-functions=256-bytes/res.txt: 699,130,207 L1-icache-load-misses ( +- 0.06% ) (100.00%)
linux-falign-functions=512-bytes/res.txt: 699,130,207 L1-icache-load-misses ( +- 0.06% ) (100.00%)
linux-falign-functions=_16-bytes/res.txt: 706,080,917 L1-icache-load-misses [vanilla kernel] ( +- 0.05% ) (100.00%)
linux-falign-functions=__1-bytes/res.txt: 724,539,055 L1-icache-load-misses ( +- 0.31% ) (100.00%)
linux-falign-functions=__4-bytes/res.txt: 725,707,848 L1-icache-load-misses ( +- 0.12% ) (100.00%)
linux-falign-functions=__8-bytes/res.txt: 726,543,194 L1-icache-load-misses ( +- 0.04% ) (100.00%)
linux-falign-functions=__2-bytes/res.txt: 738,946,179 L1-icache-load-misses ( +- 0.12% ) (100.00%)
linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 921,910,808 L1-icache-load-misses ( +- 0.05% ) (100.00%)

The optimal I$ miss rate is at 64 bytes - which is 9% better than the
default kernel's I$ miss rate at 16 bytes alignment.

The 128/256/512 bytes numbers show an increasing amount of cache
misses: probably due to the artificially reduced associativity of the
caching.

Surprisingly there's a rather marked improvement in elapsed time as
well:

linux-falign-functions=_64-bytes/res.txt: 7.154816369 seconds time elapsed ( +- 0.03% )
linux-falign-functions=_32-bytes/res.txt: 7.231074263 seconds time elapsed ( +- 0.12% )
linux-falign-functions=__8-bytes/res.txt: 7.292203002 seconds time elapsed ( +- 0.30% )
linux-falign-functions=128-bytes/res.txt: 7.314226040 seconds time elapsed ( +- 0.29% )
linux-falign-functions=_16-bytes/res.txt: 7.333597250 seconds time elapsed [vanilla kernel] ( +- 0.48% )
linux-falign-functions=__1-bytes/res.txt: 7.367139908 seconds time elapsed ( +- 0.28% )
linux-falign-functions=__4-bytes/res.txt: 7.371721930 seconds time elapsed ( +- 0.26% )
linux-falign-functions=__2-bytes/res.txt: 7.410033936 seconds time elapsed ( +- 0.34% )
linux-falign-functions=256-bytes/res.txt: 7.507029637 seconds time elapsed ( +- 0.07% )
linux-falign-functions=512-bytes/res.txt: 7.507029637 seconds time elapsed ( +- 0.07% )
linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 8.531418784 seconds time elapsed ( +- 0.19% )

the workload got 2.5% faster - which is pretty nice! This result is 5+
standard deviations above the noise of the measurement.

Side note: see how catastrophic -Os (CC_OPTIMIZE_FOR_SIZE=y)
performance is: markedly higher cache miss rate despite a 'smaller'
kernel, and the workload is 16.3% slower (!).

Part of the -Os picture is that the -Os kernel is executing much more
instructions:

linux-falign-functions=_64-bytes/res.txt: 11,851,763,357 instructions ( +- 0.01% )
linux-falign-functions=__1-bytes/res.txt: 11,852,538,446 instructions ( +- 0.01% )
linux-falign-functions=_16-bytes/res.txt: 11,854,159,736 instructions ( +- 0.01% )
linux-falign-functions=__4-bytes/res.txt: 11,864,421,708 instructions ( +- 0.01% )
linux-falign-functions=__8-bytes/res.txt: 11,865,947,941 instructions ( +- 0.01% )
linux-falign-functions=_32-bytes/res.txt: 11,867,369,566 instructions ( +- 0.01% )
linux-falign-functions=128-bytes/res.txt: 11,867,698,477 instructions ( +- 0.01% )
linux-falign-functions=__2-bytes/res.txt: 11,870,853,247 instructions ( +- 0.01% )
linux-falign-functions=256-bytes/res.txt: 11,876,281,686 instructions ( +- 0.01% )
linux-falign-functions=512-bytes/res.txt: 11,876,281,686 instructions ( +- 0.01% )
linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 14,318,175,358 instructions ( +- 0.01% )

21.2% more instructions executed ... that cannot go well.

So this should be a reminder that it's effective I$ footprint and
number of instructions executed that matters to performance, not
kernel size alone. With current GCC -Os should only be used on
embedded systems where one is willing to make the kernel 10%+ slower,
in exchange for a 20% smaller kernel.

The AMD system, with a starkly different x86 microarchitecture, is
showing similar characteristics:

linux-falign-functions=_64-bytes/res-amd.txt: 108,886,550 L1-icache-load-misses ( +- 0.10% ) (100.00%)
linux-falign-functions=_32-bytes/res-amd.txt: 110,433,214 L1-icache-load-misses ( +- 0.15% ) (100.00%)
linux-falign-functions=__1-bytes/res-amd.txt: 113,623,200 L1-icache-load-misses ( +- 0.17% ) (100.00%)
linux-falign-functions=128-bytes/res-amd.txt: 119,100,216 L1-icache-load-misses ( +- 0.22% ) (100.00%)
linux-falign-functions=_16-bytes/res-amd.txt: 122,916,937 L1-icache-load-misses ( +- 0.15% ) (100.00%)
linux-falign-functions=__8-bytes/res-amd.txt: 123,810,566 L1-icache-load-misses ( +- 0.18% ) (100.00%)
linux-falign-functions=__2-bytes/res-amd.txt: 124,337,908 L1-icache-load-misses ( +- 0.71% ) (100.00%)
linux-falign-functions=__4-bytes/res-amd.txt: 125,221,805 L1-icache-load-misses ( +- 0.09% ) (100.00%)
linux-falign-functions=256-bytes/res-amd.txt: 135,761,433 L1-icache-load-misses ( +- 0.18% ) (100.00%)
linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt: 159,918,181 L1-icache-load-misses ( +- 0.10% ) (100.00%)
linux-falign-functions=512-bytes/res-amd.txt: 170,307,064 L1-icache-load-misses ( +- 0.26% ) (100.00%)

64 bytes is a similar sweet spot. Note that the penalty at 512 bytes
is much steeper than on Intel systems: cache associativity is likely
lower on this AMD CPU.

Interestingly the 1 byte alignment result is still pretty good on AMD
systems - and I used the exact same kernel image on both systems, so
the layout of the functions is exactly the same.

Elapsed time is noisier, but shows a similar trend:

linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed ( +- 2.74% )
linux-falign-functions=128-bytes/res-amd.txt: 1.932961745 seconds time elapsed ( +- 2.18% )
linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed ( +- 1.84% )
linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed ( +- 2.15% )
linux-falign-functions=_32-bytes/res-amd.txt: 1.962074787 seconds time elapsed ( +- 2.38% )
linux-falign-functions=_16-bytes/res-amd.txt: 2.000941789 seconds time elapsed ( +- 1.18% )
linux-falign-functions=__4-bytes/res-amd.txt: 2.002305627 seconds time elapsed ( +- 2.75% )
linux-falign-functions=256-bytes/res-amd.txt: 2.003218532 seconds time elapsed ( +- 3.16% )
linux-falign-functions=__2-bytes/res-amd.txt: 2.031252839 seconds time elapsed ( +- 1.77% )
linux-falign-functions=512-bytes/res-amd.txt: 2.080632439 seconds time elapsed ( +- 1.06% )
linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt: 2.346644318 seconds time elapsed ( +- 2.19% )

64 bytes alignment is the sweet spot here as well, it's 3.7% faster
than the default 16 bytes alignment.

So based on those measurements, I think we should do the exact
opposite of my original patch that reduced alignment to 1 bytes, and
increase kernel function address alignment from 16 bytes to the
natural cache line size (64 bytes on modern CPUs).

The cost is a 6.2% larger kernel image:

13799602 2579560 1634304 18013466 linux-falign-functions=_16-bytes/vmlinux
14664898 2579560 1634304 18878762 linux-falign-functions=_64-bytes/vmlinux

But this is basically just a RAM cost, it does not increase effective
cache footprint (in fact it decreases it), so it's likely a win on
everything but RAM starved embedded systems.

In the future we could perhaps still pack small functions of certain
subsystems (such as hot locking functions).

The patch below implements the alignment optimization by the build
system using X86_L1_CACHE_SIZE to align functions, limited to 64-bit
systems for the time being.

Not-Yet-Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
---
arch/x86/Kconfig.cpu | 17 +++++++++++++++++
arch/x86/Makefile | 6 ++++++
2 files changed, 23 insertions(+)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 6983314c8b37..9eacb85efda9 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -304,6 +304,23 @@ config X86_L1_CACHE_SHIFT
default "4" if MELAN || M486 || MGEODEGX1
default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX

+config X86_L1_CACHE_SIZE
+ int
+ default "16" if X86_L1_CACHE_SHIFT=4
+ default "32" if X86_L1_CACHE_SHIFT=5
+ default "64" if X86_L1_CACHE_SHIFT=6
+ default "128" if X86_L1_CACHE_SHIFT=7
+
+#
+# We optimize function alignment on 64-bit kernels,
+# to pack the instruction cache optimally:
+#
+config X86_FUNCTION_ALIGNMENT
+ int
+ default "16" if !64BIT
+ default X86_L1_CACHE_SIZE if 64BIT && !CC_OPTIMIZE_FOR_SIZE
+ default "1" if 64BIT && CC_OPTIMIZE_FOR_SIZE
+
config X86_PPRO_FENCE
bool "PentiumPro memory ordering errata workaround"
depends on M686 || M586MMX || M586TSC || M586 || M486 || MGEODEGX1
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 57996ee840dd..45ebf0bbe833 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -83,6 +83,12 @@ else
# Pack loops tightly as well:
KBUILD_CFLAGS += -falign-loops=1

+ #
+ # Allocate a separate cacheline for every function,
+ # for optimal instruction cache packing:
+ #
+ KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_FUNCTION_ALIGNMENT)
+
# Don't autogenerate traditional x87 instructions
KBUILD_CFLAGS += $(call cc-option,-mno-80387)
KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Sebastian Andrzej Siewior: "[ANNOUNCE] 4.0.4-rt1"
Previous message: Andrew Morton: "Re: [PATCH 00/23] userfaultfd v4"
In reply to: Denys Vlasenko: "Re: [tip:x86/asm] x86: Pack function addresses tightly as well"
Next in thread: Linus Torvalds: "Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]