[PATCH 0/8] mm: memcontrol: switch to rstat v2

From: Johannes Weiner
Date: Fri Feb 05 2021 - 13:49:03 EST


This is version 2 of the memcg rstat patches. Updates since v1:

- added cgroup selftest output (see test section below) (thanks Roman)
- updated cgroup selftest to match new kernel implementation
- added Fixes: tag to 'mm: memcontrol: fix cpuhotplug statistics flushing' (Shakeel)
- collected review & ack tags
- added rstat overview to 'mm: memcontrol: switch to rstat' changelog (Michal)
- simplified memcg_flush_lruvec_page_state() and removed cpu==-1 case (Michal)

---

This series converts memcg stats tracking to the streamlined rstat
infrastructure provided by the cgroup core code. rstat is already used
by the CPU controller and the IO controller. This change is motivated
by recent accuracy problems in memcg's custom stats code, as well as
the benefits of sharing common infra with other controllers.

The current memcg implementation does batched tree aggregation on the
write side: local stat changes are cached in per-cpu counters, which
are then propagated upward in batches when a threshold (32 pages) is
exceeded. This is cheap, but the error introduced by the lazy upward
propagation adds up: 32 pages times CPUs times cgroups in the subtree.
We've had complaints from service owners that the stats do not
reliably track and react to allocation behavior as expected, sometimes
swallowing the results of entire test applications.

The original memcg stat implementation used to do tree aggregation
exclusively on the read side: local stats would only ever be tracked
in per-cpu counters, and a memory.stat read would iterate the entire
subtree and sum those counters up. This didn't keep up with the times:

- Cgroup trees are much bigger now. We switched to lazily-freed
cgroups, where deleted groups would hang around until their
remaining page cache has been reclaimed. This can result in large
subtrees that are expensive to walk, while most of the groups are
idle and their statistics don't change much anymore.

- Automated monitoring increased. With the proliferation of userspace
oom killing, proactive reclaim, and higher-resolution logging of
workload trends in general, top-level stat files are polled at least
once a second in many deployments.

- The lifetime of cgroups got shorter. Where most cgroup setups in the
past would have a few large policy-oriented cgroups for everything
running on the system, newer cgroup deployments tend to create one
group per application - which gets deleted again as the processes
exit. An aggregation scheme that doesn't retain child data inside
the parents loses event history of the subtree.

Rstat addresses all three of those concerns through intelligent,
persistent read-side aggregation. As statistics change at the local
level, rstat tracks - on a per-cpu basis - only those parts of a
subtree that have changes pending and require aggregation. The actual
aggregation occurs on the colder read side - which can now skip over
(potentially large) numbers of recently idle cgroups.

---

The test_kmem cgroup selftest is currently failing due to excessive
cumulative vmstat drift from 100 subgroups:

ok 1 test_kmem_basic
memory.current = 8810496
slab + anon + file + kernel_stack = 17074568
slab = 6101384
anon = 946176
file = 0
kernel_stack = 10027008
not ok 2 test_kmem_memcg_deletion
ok 3 test_kmem_proc_kpagecgroup
ok 4 test_kmem_kernel_stacks
ok 5 test_kmem_dead_cgroups
ok 6 test_percpu_basic

As you can see, memory.stat items far exceed memory.current. The
kernel stack alone is bigger than all of charged memory. That's
because the memory of the test has been uncharged from memory.current,
but the negative vmstat deltas are still sitting in the percpu caches.

The test at this time isn't even counting percpu, pagetables etc. yet,
which would further contribute to the error. The last patch in the
series updates the test to include them - as well as match error
expectations on the new implementation to catch future regressions.

With all patches applied, the (updated, more rigorous) test succeeds:

ok 1 test_kmem_basic
ok 2 test_kmem_memcg_deletion
ok 3 test_kmem_proc_kpagecgroup
ok 4 test_kmem_kernel_stacks
ok 5 test_kmem_dead_cgroups
ok 6 test_percpu_basic

---

A kernel build test confirms that overhead is comparable. Two kernels
are built simultaneously in a nested tree with several idle siblings:

root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16)
`- build-b (defconfig, make -j16)
`- idle-1
`- ...
`- idle-9

During the builds, kernelbuild/memory.stat is read once a second.

A perf diff shows that the changes in cycle distribution is
minimal. Top 10 kernel symbols:

0.09% +0.08% [kernel.kallsyms] [k] __mod_memcg_lruvec_state
0.00% +0.06% [kernel.kallsyms] [k] cgroup_rstat_updated
0.08% -0.05% [kernel.kallsyms] [k] __mod_memcg_state.part.0
0.16% -0.04% [kernel.kallsyms] [k] release_pages
0.00% +0.03% [kernel.kallsyms] [k] __count_memcg_events
0.01% +0.03% [kernel.kallsyms] [k] mem_cgroup_charge_statistics.constprop.0
0.10% -0.02% [kernel.kallsyms] [k] get_mem_cgroup_from_mm
0.05% -0.02% [kernel.kallsyms] [k] mem_cgroup_update_lru_size
0.57% +0.01% [kernel.kallsyms] [k] asm_exc_page_fault


---

The on-demand aggregated stats are now fully accurate:

$ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \
grep -e inactive_file /sys/fs/cgroup/memory.stat

vanilla: patched:
nr_inactive_file 1574105088 nr_inactive_file 1027801088
inactive_file 1577410560 inactive_file 1027801088

---

block/blk-cgroup.c | 14 +-
include/linux/memcontrol.h | 119 ++++-------
kernel/cgroup/cgroup.c | 34 +--
kernel/cgroup/rstat.c | 62 +++---
mm/memcontrol.c | 306 +++++++++++++--------------
tools/testing/selftests/cgroup/test_kmem.c | 22 +-
6 files changed, 266 insertions(+), 291 deletions(-)

Based on v5.11-rc5-mm1.