[PATCH v9 0/3] mm: Randomize free memory

From: Dan Williams
Date: Wed Jan 30 2019 - 00:14:50 EST


Changes since v8 [1]:
* Rework shuffle call sites from 3 locations to 2, i.e. one for the
initial memory online path, and one for the hotplug memory online path.
This simplification results in an incremental diffstat of "7 files
changed, 31 insertions(+), 82 deletions(-)". The consolidation of the
initial shuffle in page_alloc_init_late() leads to a beneficial increase
in the number of shuffles performed in a qemu-VM test. (Michal)

* Drop the CONFIG_SHUFFLE_PAGE_ORDER configuration option. If it turns out
that there is a use case to make the shuffle-order dynamic that can be
addressed in a follow on update, but no such case is known at present.
(Michal)

* Replace lkml.org links with lkml.kernel.org, where possible.
Unfortunately lkml.kernel.org failed to capture Mel's feedback, so the
lkml.org link remains for that one. (Michal)

* Fix definition of pfn_present() in the !sparsemem case. (Michal)

* Collect Michal's ack on patch2, and open code rmv_page_order() in its
only caller.

[1]: https://lkml.kernel.org/r/154767945660.1983228.12167020940431682725.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

---

Hi Andrew,

As you can see the series is improved thanks to Michal's review. Please
await his ack, but I believe this version addresses all pending
feedback.

Still based on v5.0-rc1 for my tests, but it applies and builds cleanly
to current linux-next.

---

Quote Patch 1:

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache. Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms. In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [2].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [3], and I copy it here:

It's been a problem in the HPC space:
http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

A kernel module called zonesort is available to try to help:
https://software.intel.com/en-us/articles/xeon-phi-software

and this abandoned patch series proposed that for the kernel:
https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@xxxxxxxxx

Dan's patch series doesn't attempt to ensure buffers won't conflict, but
also reduces the chance that the buffers will. This will make performance
more consistent, albeit slower than "optimal" (which is near impossible
to attain in a general-purpose kernel). That's better than forcing
users to deploy remedies like:
"To eliminate this gradual degradation, we have added a Stream
measurement to the Node Health Check that follows each job;
nodes are rebooted whenever their measured memory bandwidth
falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e58f
"x86/numa_emulation: Introduce uniform split capability". With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes. A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable. While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts. Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

See patch 1 for more details.

[2]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[3]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

---

Dan Williams (3):
mm: Shuffle initial free memory to improve memory-side-cache utilization
mm: Move buddy list manipulations into helpers
mm: Maintain randomization of page free lists


include/linux/list.h | 17 ++++
include/linux/mm.h | 3 -
include/linux/mm_types.h | 3 +
include/linux/mmzone.h | 62 ++++++++++++++
include/linux/shuffle.h | 57 +++++++++++++
init/Kconfig | 23 +++++
mm/Makefile | 7 +-
mm/compaction.c | 4 -
mm/memblock.c | 1
mm/memory_hotplug.c | 3 +
mm/page_alloc.c | 85 +++++++++----------
mm/shuffle.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++
12 files changed, 419 insertions(+), 50 deletions(-)
create mode 100644 include/linux/shuffle.h
create mode 100644 mm/shuffle.c