Re: [PATCH] mm: memcg: use rstat for non-hierarchical stats

From: Yosry Ahmed
Date: Tue Jul 25 2023 - 18:01:31 EST


On Tue, Jul 25, 2023 at 1:18 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Tue, Jul 25, 2023 at 12:24:19PM -0700, Yosry Ahmed wrote:
> > On Tue, Jul 25, 2023 at 7:04 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > > We used to maintain *all* stats in per-cpu counters at the local
> > > level. memory.stat reads would have to iterate and aggregate the
> > > entire subtree every time. This was obviously very costly, so we added
> > > batched upward propagation during stat updates to simplify reads:
> > >
> > > commit 42a300353577ccc17ecc627b8570a89fa1678bec
> > > Author: Johannes Weiner <hannes@xxxxxxxxxxx>
> > > Date: Tue May 14 15:47:12 2019 -0700
> > >
> > > mm: memcontrol: fix recursive statistics correctness & scalabilty
> > >
> > > However, that caused a regression in the stat write path, as the
> > > upward propagation would bottleneck on the cachelines in the shared
> > > parents. The fix for *that* re-introduced the per-cpu loops in the
> > > local stat reads:
> > >
> > > commit 815744d75152078cde5391fc1e3c2d4424323fb6
> > > Author: Johannes Weiner <hannes@xxxxxxxxxxx>
> > > Date: Thu Jun 13 15:55:46 2019 -0700
> > >
> > > mm: memcontrol: don't batch updates of local VM stats and events
> > >
> > > So I wouldn't say it's a regression from rstat. Except for that short
> > > period between the two commits above, the read side for local stats
> > > was always expensive.
> >
> > I was comparing from an 4.15 kernel, so I assumed the major change was
> > from rstat, but that was not accurate. Thanks for the history.
> >
> > However, in that 4.15 kernel the local (non-hierarchical) stats were
> > readily available without iterating percpu counters. There is a
> > regression that was introduced somewhere.
> >
> > Looking at the history you described, it seems like up until
> > 815744d75152 we used to maintain "local" (aka non-hierarchical)
> > counters, so reading local stats was reading one counter, and starting
> > 815744d75152 we started having to loop percpu counters for that.
> >
> > So it is not a regression of rstat, but seemingly it is a regression
> > of 815744d75152. Is my understanding incorrect?
>
> Yes, it actually goes back further. Bear with me.
>
> For the longest time, it used to be local per-cpu counters. Every
> memory.stat read had to do nr_memcg * nr_cpu aggregation. You can
> imagine that this didn't scale in production.
>
> We added local atomics and turned the per-cpu counters into buffers:
>
> commit a983b5ebee57209c99f68c8327072f25e0e6e3da
> Author: Johannes Weiner <hannes@xxxxxxxxxxx>
> Date: Wed Jan 31 16:16:45 2018 -0800
>
> mm: memcontrol: fix excessive complexity in memory.stat reporting
>
> Local counts became a simple atomic_read(), but the hierarchy counts
> would still have to aggregate nr_memcg counters.
>
> That was of course still too much read-side complexity, so we switched
> to batched upward propagation during the stat updates:
>
> commit 42a300353577ccc17ecc627b8570a89fa1678bec
> Author: Johannes Weiner <hannes@xxxxxxxxxxx>
> Date: Tue May 14 15:47:12 2019 -0700
>
> mm: memcontrol: fix recursive statistics correctness & scalabilty
>
> This gave us two atomics at each level: one for local and one for
> hierarchical stats.
>
> However, that went too far the other direction: too many counters
> touched during stat updates, and we got a regression report over memcg
> cacheline contention during MM workloads. Instead of backing out
> 42a300353 - since all the previous versions were terrible too - we
> dropped write-side aggregation of *only* the local counters:
>
> commit 815744d75152078cde5391fc1e3c2d4424323fb6
> Author: Johannes Weiner <hannes@xxxxxxxxxxx>
> Date: Thu Jun 13 15:55:46 2019 -0700
>
> mm: memcontrol: don't batch updates of local VM stats and events
>
> In effect, this kept all the stat optimizations for cgroup2 (which
> doesn't have local counters), and reverted cgroup1 back to how it was
> for the longest time: on-demand aggregated per-cpu counters.
>
> For about a year, cgroup1 didn't have to per-cpu the local stats on
> read. But for the recursive stats, it would either still have to do
> subtree aggregation on read, or too much upward flushing on write.
>
> So if I had to blame one commit for a cgroup1 regression, it would
> probably be 815744d. But it's kind of a stretch to say that it worked
> well before that commit.
>
> For the changelog, maybe just say that there was a lot of back and
> forth between read-side aggregation and write-side aggregation. Since
> with rstat we now have efficient read-side aggregation, attempt a
> conceptual revert of 815744d.

Now that's a much more complete picture. Thanks a lot for all the
history, now it makes much more sense. I wouldn't blame 815744d then,
as you said it's better framed as a conceptual revert of it. I will
rewrite the commit log accordingly and send a v2.

Thanks!

>
> > > But I want to be clear: this isn't a regression fix. It's a new
> > > performance optimization for the deprecated cgroup1 code. And it comes
> > > at the cost of higher memory footprint for both cgroup1 AND cgroup2.
> >
> > I still think it is, but I can easily be wrong. I am hoping that the
> > memory footprint is not a problem. There are *roughly* 80 per-memcg
> > stats/events (MEMCG_NR_STAT + NR_MEMCG_EVENTS) and 55 per-lruvec stats
> > (NR_VM_NODE_STAT_ITEMS). For each stat there is an extra 8 bytes, so
> > on a two-node machine that's 8 * (80 + 55 * 2) ~= 1.5 KiB per memcg.
> >
> > Given that struct mem_cgroup is already large, and can easily be 100s
> > of KiBs on a large machine with many cpus, I hope there won't be a
> > noticeable regression.
>
> Yes, the concern wasn't so much the memory consumption but the
> cachelines touched during hot paths.
>
> However, that was mostly because we either had a) write-side flushing,
> which is extremely hot during MM stress, or b) read-side flushing with
> huuuge cgroup subtrees due to zombie cgroups. A small cacheline
> difference would be enormously amplified by these factors.
>
> Rstat is very good at doing selective subtree flushing on reads, so
> the big coefficients from a) and b) are no longer such a big concern.
> A slightly bigger cache footprint is probably going to be okay.

Agreed, maintaining the local counters with rstat is much easier than
the previous attempts. I will try to bake most of this into the commit
log.