Re: [PATCH] x86, vmlinux.lds: Add debug option to force all data sections aligned

From: Denys Vlasenko
Date: Wed Feb 16 2022 - 08:35:19 EST


On 2/16/22 9:28 AM, Feng Tang wrote:
0day has reported many strange performance changes (regression or
improvement), in which there was no obvious relation between the culprit
commit and the benchmark at the first look, and it causes people to doubt
the test itself is wrong.

Upon further check, many of these cases are caused by the change to the
alignment of kernel text or data, as whole text/data of kernel are linked
together, change in one domain can affect alignments of other domains.

To help quickly identifying if the strange performance change is caused
by _data_ alignment, add a debug option to force the data sections from
all .o files aligned on THREAD_SIZE, so that change in one domain won't
affect other modules' data alignment.

We have used this option to check some strange kernel changes [1][2][3],
and those performance changes were gone after enabling it, which proved
they are data alignment related. Besides these publicly reported cases,
recently there are other similar cases found by 0day, and this option
has been actively used by 0Day for analyzing strange performance changes.
...
+ .data : AT(ADDR(.data) - LOAD_OFFSET)
+#ifdef CONFIG_DEBUG_FORCE_DATA_SECTION_ALIGNED
+ /* Use the biggest alignment of below sections */
+ SUBALIGN(THREAD_SIZE)
+#endif

"Align every input section to 4096 bytes" ?

This is way, way, WAY too much. The added padding will be very wasteful.

Performance differences are likely to be caused by cacheline alignment.
Factoring in an odd hardware prefetcher grabbing an additional
cacheline after every accessed one, I'd say alignment to 128 bytes
(on x86) should suffice for almost any scenario. Even 64 bytes
would almost always work fine.

The hardware prefetcher grabbing an additional cacheline was seen
adversely affecting locking performance in a structure - developers
thought two locks are not in the same cacheline, but because of
this "optimization" they effectively are, and thus they bounce
between CPUs. (1) Linker script can't help with this, since it was
struct layout issue, not section alignment issue.
(2) This "optimization" (unconditional fetch of next cacheline)
might be bad enough to warrant detecting and disabling on boot.