Re: [PATCH] mm: memcg: use rstat for non-hierarchical stats

From: Johannes Weiner
Date: Tue Jul 25 2023 - 16:18:23 EST


On Tue, Jul 25, 2023 at 12:24:19PM -0700, Yosry Ahmed wrote:
> On Tue, Jul 25, 2023 at 7:04 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > We used to maintain *all* stats in per-cpu counters at the local
> > level. memory.stat reads would have to iterate and aggregate the
> > entire subtree every time. This was obviously very costly, so we added
> > batched upward propagation during stat updates to simplify reads:
> >
> > commit 42a300353577ccc17ecc627b8570a89fa1678bec
> > Author: Johannes Weiner <hannes@xxxxxxxxxxx>
> > Date: Tue May 14 15:47:12 2019 -0700
> >
> > mm: memcontrol: fix recursive statistics correctness & scalabilty
> >
> > However, that caused a regression in the stat write path, as the
> > upward propagation would bottleneck on the cachelines in the shared
> > parents. The fix for *that* re-introduced the per-cpu loops in the
> > local stat reads:
> >
> > commit 815744d75152078cde5391fc1e3c2d4424323fb6
> > Author: Johannes Weiner <hannes@xxxxxxxxxxx>
> > Date: Thu Jun 13 15:55:46 2019 -0700
> >
> > mm: memcontrol: don't batch updates of local VM stats and events
> >
> > So I wouldn't say it's a regression from rstat. Except for that short
> > period between the two commits above, the read side for local stats
> > was always expensive.
>
> I was comparing from an 4.15 kernel, so I assumed the major change was
> from rstat, but that was not accurate. Thanks for the history.
>
> However, in that 4.15 kernel the local (non-hierarchical) stats were
> readily available without iterating percpu counters. There is a
> regression that was introduced somewhere.
>
> Looking at the history you described, it seems like up until
> 815744d75152 we used to maintain "local" (aka non-hierarchical)
> counters, so reading local stats was reading one counter, and starting
> 815744d75152 we started having to loop percpu counters for that.
>
> So it is not a regression of rstat, but seemingly it is a regression
> of 815744d75152. Is my understanding incorrect?

Yes, it actually goes back further. Bear with me.

For the longest time, it used to be local per-cpu counters. Every
memory.stat read had to do nr_memcg * nr_cpu aggregation. You can
imagine that this didn't scale in production.

We added local atomics and turned the per-cpu counters into buffers:

commit a983b5ebee57209c99f68c8327072f25e0e6e3da
Author: Johannes Weiner <hannes@xxxxxxxxxxx>
Date: Wed Jan 31 16:16:45 2018 -0800

mm: memcontrol: fix excessive complexity in memory.stat reporting

Local counts became a simple atomic_read(), but the hierarchy counts
would still have to aggregate nr_memcg counters.

That was of course still too much read-side complexity, so we switched
to batched upward propagation during the stat updates:

commit 42a300353577ccc17ecc627b8570a89fa1678bec
Author: Johannes Weiner <hannes@xxxxxxxxxxx>
Date: Tue May 14 15:47:12 2019 -0700

mm: memcontrol: fix recursive statistics correctness & scalabilty

This gave us two atomics at each level: one for local and one for
hierarchical stats.

However, that went too far the other direction: too many counters
touched during stat updates, and we got a regression report over memcg
cacheline contention during MM workloads. Instead of backing out
42a300353 - since all the previous versions were terrible too - we
dropped write-side aggregation of *only* the local counters:

commit 815744d75152078cde5391fc1e3c2d4424323fb6
Author: Johannes Weiner <hannes@xxxxxxxxxxx>
Date: Thu Jun 13 15:55:46 2019 -0700

mm: memcontrol: don't batch updates of local VM stats and events

In effect, this kept all the stat optimizations for cgroup2 (which
doesn't have local counters), and reverted cgroup1 back to how it was
for the longest time: on-demand aggregated per-cpu counters.

For about a year, cgroup1 didn't have to per-cpu the local stats on
read. But for the recursive stats, it would either still have to do
subtree aggregation on read, or too much upward flushing on write.

So if I had to blame one commit for a cgroup1 regression, it would
probably be 815744d. But it's kind of a stretch to say that it worked
well before that commit.

For the changelog, maybe just say that there was a lot of back and
forth between read-side aggregation and write-side aggregation. Since
with rstat we now have efficient read-side aggregation, attempt a
conceptual revert of 815744d.

> > But I want to be clear: this isn't a regression fix. It's a new
> > performance optimization for the deprecated cgroup1 code. And it comes
> > at the cost of higher memory footprint for both cgroup1 AND cgroup2.
>
> I still think it is, but I can easily be wrong. I am hoping that the
> memory footprint is not a problem. There are *roughly* 80 per-memcg
> stats/events (MEMCG_NR_STAT + NR_MEMCG_EVENTS) and 55 per-lruvec stats
> (NR_VM_NODE_STAT_ITEMS). For each stat there is an extra 8 bytes, so
> on a two-node machine that's 8 * (80 + 55 * 2) ~= 1.5 KiB per memcg.
>
> Given that struct mem_cgroup is already large, and can easily be 100s
> of KiBs on a large machine with many cpus, I hope there won't be a
> noticeable regression.

Yes, the concern wasn't so much the memory consumption but the
cachelines touched during hot paths.

However, that was mostly because we either had a) write-side flushing,
which is extremely hot during MM stress, or b) read-side flushing with
huuuge cgroup subtrees due to zombie cgroups. A small cacheline
difference would be enormously amplified by these factors.

Rstat is very good at doing selective subtree flushing on reads, so
the big coefficients from a) and b) are no longer such a big concern.
A slightly bigger cache footprint is probably going to be okay.