[RFC PATCH v2 0/8] lru_lock scalability and SMP list functions

From: Daniel Jordan
Date: Mon Sep 10 2018 - 20:43:16 EST


Hi,

This is a work-in-progress of what I presented at LSF/MM this year[0] to
greatly reduce contention on lru_lock, allowing it to scale on large systems.

This is completely different from the lru_lock series posted last January[1].

I'm hoping for feedback on the overall design and general direction as I do
more real-world performance testing and polish the code. Is this a workable
approach?

Thanks,
Daniel

---

Summary: lru_lock can be one of the hottest locks in the kernel on big
systems. It guards too much state, so introduce new SMP-safe list functions to
allow multiple threads to operate on the LRUs at once. The SMP list functions
are provided in a standalone API that can be used in other parts of the kernel.
When lru_lock and zone->lock are both fixed, the kernel can do up to 73.8% more
page faults per second on a 44-core machine.

---

On large systems, lru_lock can become heavily contended in memory-intensive
workloads such as decision support, applications that manage their memory
manually by allocating and freeing pages directly from the kernel, and
workloads with short-lived processes that force many munmap and exit
operations. lru_lock also inhibits scalability in many of the MM paths that
could be parallelized, such as freeing pages during exit/munmap and inode
eviction.

The problem is that lru_lock is too big of a hammer. It guards all the LRUs in
a pgdat's lruvec, needlessly serializing add-to-front, add-to-tail, and delete
operations that are done on disjoint parts of an LRU, or even completely
different LRUs.

This RFC series, developed in collaboration with Yossi Lev and Dave Dice,
offers a two-part solution to this problem.

First, three new list functions are introduced to allow multiple threads to
operate on the same linked list simultaneously under certain conditions, which
are spelled out in more detail in code comments and changelogs. The functions
are smp_list_del, smp_list_splice, and smp_list_add, and do the same things as
their non-SMP-safe counterparts. These primitives may be used elsewhere in the
kernel as the need arises; for example, in the page allocator free lists to
scale zone->lock[2], or in file system LRUs[3].

Second, lru_lock is converted from a spinlock to a rwlock. The idea is to
repurpose rwlock as a two-mode lock, where callers take the lock in shared
(i.e. read) mode for code using the SMP list functions, and exclusive (i.e.
write) mode for existing code that expects exclusive access to the LRUs.
Multiple threads are allowed in under the read lock, of course, and they use
the SMP list functions to synchronize amongst themselves.

The rwlock is scaffolding to facilitate the transition from big-hammer lru_lock
as it exists today to just using the list locking primitives and getting rid of
lru_lock entirely. Such an approach allows incremental conversion of lru_lock
writers until everything uses the SMP list functions and takes the lock in
shared mode, at which point lru_lock can just go away.

This RFC series is incomplete. More, and more realistic, performance
numbers are needed; for now, I show only will-it-scale/page_fault1.
Also, there are extensions I'd like to make to the locking scheme to
handle certain lru_lock paths--in particular, those where multiple
threads may delete the same node from an LRU. The SMP list functions
now handle only removal of _adjacent_ nodes from an LRU. Finally, the
diffstat should become more supportive after I remove some of the code
duplication in patch 6 by converting the rest of the per-CPU pagevec
code in mm/swap.c to use the SMP list functions.

Series based on 4.17.


Performance Numbers
-------------------

In the artificial benchmark will-it-scale/page_fault1, N tasks mmap, touch each
4K page in, and munmap an anonymous 128M memory region. The test is run in
process and thread modes on a 44-core Xeon E5-2699 v4 with 503G memory and
using a 4.16 baseline kernel. The table of results below is also graphed at:

https://raw.githubusercontent.com/danieljordan10/lru_lock-scalability/master/rfc-v2/graph.png

The lu-zone kernel refers to Aaron Lu's work from [4], the lru kernel is this
work, and the lru-lu-zone kernel contains both.

kernel (#) ntask proc thr proc stdev thr stdev
speedup speedup pgf/s pgf/s

baseline (1) 1 626,944 910 625,509 756
lu-zone (2) 1 18.0% 17.6% 739,659 2,038 735,896 2,139
lru (3) 1 0.1% -0.1% 627,490 878 625,162 770
lru-lu-zone (4) 1 17.4% 17.2% 735,983 2,349 732,936 2,640

baseline (1) 2 1,206,238 4,012 1,083,497 4,571
lu-zone (2) 2 2.4% 1.3% 1,235,318 3,158 1,097,745 8,919
lru (3) 2 0.1% 0.4% 1,207,246 4,988 1,087,846 5,700
lru-lu-zone (4) 2 2.4% 0.0% 1,235,271 3,005 1,083,578 6,915

baseline (1) 3 1,751,889 5,887 1,443,610 11,049
lu-zone (2) 3 -0.0% 1.9% 1,751,247 5,646 1,470,561 13,407
lru (3) 3 -0.4% 0.5% 1,744,999 7,040 1,451,507 13,186
lru-lu-zone (4) 3 -0.3% 0.2% 1,747,431 4,420 1,447,024 9,847

baseline (1) 4 2,260,201 11,482 1,769,095 16,576
lu-zone (2) 4 -0.5% 2.7% 2,249,463 14,628 1,816,409 10,988
lru (3) 4 -0.5% -0.8% 2,248,302 7,457 1,754,936 13,288
lru-lu-zone (4) 4 -0.8% 1.2% 2,243,240 10,386 1,790,833 11,186

baseline (1) 5 2,735,351 9,731 2,022,303 18,199
lu-zone (2) 5 -0.0% 3.1% 2,734,270 13,779 2,084,468 11,230
lru (3) 5 -0.5% -2.6% 2,721,069 8,417 1,970,701 14,747
lru-lu-zone (4) 5 0.0% -0.3% 2,736,317 11,533 2,016,043 10,324

baseline (1) 6 3,186,435 13,939 2,241,910 22,103
lu-zone (2) 6 0.7% 3.1% 3,207,879 17,603 2,311,070 12,345
lru (3) 6 -0.1% -1.6% 3,183,244 9,285 2,205,632 22,457
lru-lu-zone (4) 6 0.2% -0.2% 3,191,478 10,418 2,236,722 10,814

baseline (1) 7 3,596,306 17,419 2,374,538 29,782
lu-zone (2) 7 1.1% 5.6% 3,637,085 21,485 2,506,351 11,448
lru (3) 7 0.1% -1.2% 3,600,797 9,867 2,346,063 22,613
lru-lu-zone (4) 7 1.1% 1.6% 3,635,861 12,439 2,413,326 15,745

baseline (1) 8 3,986,712 15,947 2,519,189 30,129
lu-zone (2) 8 1.3% 3.7% 4,038,783 30,921 2,613,556 8,244
lru (3) 8 0.3% -0.8% 3,997,520 11,823 2,499,498 28,395
lru-lu-zone (4) 8 1.7% 4.3% 4,054,638 11,798 2,626,450 9,534

baseline (1) 11 5,138,494 17,634 2,932,708 31,765
lu-zone (2) 11 3.0% 6.6% 5,292,452 40,896 3,126,170 21,254
lru (3) 11 1.1% -1.1% 5,193,843 11,302 2,900,615 24,055
lru-lu-zone (4) 11 4.6% 2.4% 5,374,562 10,654 3,002,084 24,255

baseline (1) 22 7,569,187 13,649 3,134,116 58,149
lu-zone (2) 22 3.0% 9.6% 7,799,567 97,785 3,436,117 33,450
lru (3) 22 2.9% -0.8% 7,785,212 15,388 3,109,155 41,575
lru-lu-zone (4) 22 28.8% 7.6% 9,747,634 17,156 3,372,679 38,762

baseline (1) 33 12,375,135 30,387 2,180,328 56,529
lu-zone (2) 33 1.9% 8.9% 12,613,466 77,903 2,373,756 30,706
lru (3) 33 2.1% 3.1% 12,637,508 22,712 2,248,516 42,622
lru-lu-zone (4) 33 19.2% 9.1% 14,749,004 25,344 2,378,286 29,552

baseline (1) 44 13,446,153 14,539 2,365,487 53,966
lu-zone (2) 44 3.2% 7.8% 13,876,101 112,976 2,549,351 50,656
lru (3) 44 2.0% -8.1% 13,717,051 16,931 2,173,398 46,818
lru-lu-zone (4) 44 18.6% 7.4% 15,947,859 26,001 2,540,516 56,259

baseline (1) 55 12,050,977 30,210 2,372,251 58,151
lu-zone (2) 55 4.6% 3.2% 12,602,426 132,027 2,448,653 74,321
lru (3) 55 1.1% 1.5% 12,184,481 34,199 2,408,744 76,854
lru-lu-zone (4) 55 46.3% 3.1% 17,628,736 25,595 2,446,613 60,293

baseline (1) 66 11,464,526 146,140 2,389,751 55,417
lu-zone (2) 66 5.7% 17.5% 12,114,164 41,711 2,806,805 38,868
lru (3) 66 0.4% 13.2% 11,510,009 74,300 2,704,305 46,038
lru-lu-zone (4) 66 58.6% 17.0% 18,185,360 27,496 2,796,004 96,458

baseline (1) 77 10,818,432 149,865 2,634,568 49,631
lu-zone (2) 77 5.7% 4.9% 11,438,010 82,363 2,764,441 42,014
lru (3) 77 0.5% 3.5% 10,874,888 80,922 2,727,729 66,843
lru-lu-zone (4) 77 66.5% 1.4% 18,016,393 23,742 2,670,715 36,931

baseline (1) 88 10,321,790 214,000 2,815,546 40,597
lu-zone (2) 88 5.7% 8.3% 10,913,729 168,111 3,048,392 51,833
lru (3) 88 0.1% -3.6% 10,328,335 142,255 2,715,226 46,835
lru-lu-zone (4) 88 73.8% -3.6% 17,942,799 22,720 2,715,442 33,006

The lru-lu-zone kernel containing both lru_lock and zone->lock enhancements
outperforms kernels with either enhancement on its own. From this it's clear
that, no matter how each lock is scaled, both locks must be fixed to get rid of
this scalability issue in page allocation and freeing.

The thread case doesn't show much improvement because mmap_sem, not
lru_lock or zone->lock, is the limiting factor.

Low task counts show slight regressions but are mostly in the noise.

There's a significant speedup in the zone->lock kernels for the 1-task case,
possibly because the pages aren't merged when they're returned to the free
lists and so the cache is more likely to be warm on the next allocation.


Apart from this artificial microbenchmark, I'm experimenting with an extended
version of the SMP list locking functions (not shown here, still working on
these) that has allowed a commercial database using 4K pages to exit 6.3x
faster. This is with only lru_lock modified, no other kernel changes. The
speedup comes from the SMP list functions allowing the many database processes
to make concurrent mark_page_accessed calls from zap_pte_range while the shared
memory region is being freed. I'll post more about this later.

[0] https://lwn.net/Articles/753058/
[1] http://lkml.kernel.org/r/20180131230413.27653-1-daniel.m.jordan@xxxxxxxxxx
[2] http://lkml.kernel.org/r/20180509085450.3524-1-aaron.lu@xxxxxxxxx
[3] http://lkml.kernel.org/r/6bd1c8a5-c682-a3ce-1f9f-f1f53b4117a9@xxxxxxxxxx
[4] http://lkml.kernel.org/r/20180320085452.24641-1-aaron.lu@xxxxxxxxx

Daniel Jordan (8):
mm, memcontrol.c: make memcg lru stats thread-safe without lru_lock
mm: make zone_reclaim_stat updates thread-safe
mm: convert lru_lock from a spinlock_t to a rwlock_t
mm: introduce smp_list_del for concurrent list entry removals
mm: enable concurrent LRU removals
mm: splice local lists onto the front of the LRU
mm: introduce smp_list_splice to prepare for concurrent LRU adds
mm: enable concurrent LRU adds

include/linux/list.h | 3 +
include/linux/memcontrol.h | 43 ++++++--
include/linux/mm_inline.h | 28 +++++
include/linux/mmzone.h | 54 +++++++++-
init/main.c | 1 +
lib/Makefile | 2 +-
lib/list.c | 204 +++++++++++++++++++++++++++++++++++++
mm/compaction.c | 99 +++++++++---------
mm/huge_memory.c | 6 +-
mm/memcontrol.c | 53 ++++------
mm/memory_hotplug.c | 1 +
mm/mlock.c | 10 +-
mm/mmzone.c | 14 +++
mm/page_alloc.c | 2 +-
mm/page_idle.c | 4 +-
mm/swap.c | 183 ++++++++++++++++++++++++++++-----
mm/vmscan.c | 84 ++++++++-------
17 files changed, 620 insertions(+), 171 deletions(-)
create mode 100644 lib/list.c

--
2.18.0