Re: [PATCH] mm: memcg: provide accurate stats for userspace reads

From: Yosry Ahmed
Date: Tue Aug 15 2023 - 22:21:06 EST


On Tue, Aug 15, 2023 at 6:14 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
>
> On Tue, Aug 15, 2023 at 5:29 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> >
> [...]
> > >
> > > I thought we already reached the decision on how to proceed here. Let
> > > me summarize what I think we should do:
> > >
> > > 1. Completely remove the sync flush from stat files read from userspace.
> > > 2. Provide a separate way/interface to explicitly flush stats for
> > > users who want more accurate stats and can pay the cost. This is
> > > similar to the stat_refresh interface.
> > > 3. Keep the 2 sec periodic stats flusher.
> >
> > I think this solution is suboptimal to be honest, I think we can do better.
> >
> > With recent improvements to spinlocks/mutexes, and flushers becoming
> > sleepable, I think a better solution would be to remove unified
> > flushing and let everyone only flush the subtree they care about. Sync
> > flushing becomes much better (unless you're flushing root ofc), and
> > concurrent flushing wouldn't cause too many problems (ideally no
> > thundering herd, and rstat lock can be dropped at cpu boundaries in
> > cgroup_rstat_flush_locked()).
> >
> > If we do this, stat reads can be much faster as Ivan demonstrated with
> > his patch that only flushes the cgroup being read, and we do not
> > sacrifice accuracy as we never skip flushing. We also do not need a
> > separate interface for explicit refresh.
> >
> > In all cases, we need to keep the 2 sec periodic flusher. What we need
> > to figure out if we remove unified flushing is:
> >
> > 1. Handling stats_flush_threshold.
> > 2. Handling flush_next_time.
> >
> > Both of these are global now, and will need to be adapted to
> > non-unified non-global flushing.
>
> The only thing we are disagreeing on is (1) the complete removal of
> sync flush and an explicit flush interface versus (2) keep doing the
> sync flush of the subtree.
>
> To me (1) seems more optimal particularly for the server use-case
> where a node controller reads stats of root and as well as cgroups of
> a couple of top levels (we actually do this internally). Doing flush
> once explicitly and then reading the stats for all such cgroups seems
> better to me.

The problem in (1) is that first of all it's a behavioral change, we
start having explicit staleness in the stats, and userspace needs to
adapt by explicitly requesting a flush. A node controller can be
enlightened to do so, but on a system with a lot of cgroups, if you
flush once explicitly and iterate through all cgroups, the flush will
be stale by the time you reach the last cgroup. Keep in mind there are
also users that read their own stats, figuring out which users need to
flush explicitly vs. read cached stats is a problem.

Taking a step back, the total work that needs to be done does not
change with (2). A node controller iterating cgroups and reading their
stats will do the same amount of flushing, it will just be distributed
across multiple read syscalls, so shorter intervals in kernel space.

There are also in-kernel flushers (e.g. reclaim and dirty throttling)
that will benefit from (2) by reading more accurate stats without
having to flush the entire tree. The behavior is currently
indeterministic, you may get fresh or stale stats, you may flush one
cgroup or 100 cgroups.

I think with (2) we make less compromises in terms of accuracy and
determinism, and it's a less disruptive change to userspace.