Re: [PATCH] mm: memcg: provide accurate stats for userspace reads

From: Yosry Ahmed
Date: Tue Aug 15 2023 - 20:30:22 EST


On Tue, Aug 15, 2023 at 5:23 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
>
> +Ivan
>
> On Mon, Aug 14, 2023 at 5:51 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> >
> > On Mon, Aug 14, 2023 at 5:48 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
> > >
> > > Hello,
> > >
> > > On Mon, Aug 14, 2023 at 05:39:15PM -0700, Yosry Ahmed wrote:
> > > > I believe dropping unified flushing, if possible of course, may fix
> > > > both problems.
> > >
> > > Yeah, flushing the whole tree for every stat read will push up the big O
> > > complexity of the operation. It shouldn't be too bad because only what's
> > > updated since the last read will need flushing but if you have a really big
> > > machine with a lot of constantly active cgroups, you're still gonna feel it.
> > > So, yeah, drop that and switch the global lock to mutex and we should all be
> > > good?
> >
> > I hope so, but I am not sure.
> >
> > The unified flushing was added initially to mitigate a thundering herd
> > problem from concurrent in-kernel flushers (e.g. concurrent reclaims),
> > but back then flushing was atomic so we had to keep the spinlock held
> > for a long time. I think it should be better now, but I am hoping
> > Shakeel will chime in since he added the unified flushing originally.
> >
> > We also need to agree on what to do about stats_flushing_threshold and
> > flush_next_time since they're both global now (since all flushing is
> > global).
> >
>
> I thought we already reached the decision on how to proceed here. Let
> me summarize what I think we should do:
>
> 1. Completely remove the sync flush from stat files read from userspace.
> 2. Provide a separate way/interface to explicitly flush stats for
> users who want more accurate stats and can pay the cost. This is
> similar to the stat_refresh interface.
> 3. Keep the 2 sec periodic stats flusher.

I think this solution is suboptimal to be honest, I think we can do better.

With recent improvements to spinlocks/mutexes, and flushers becoming
sleepable, I think a better solution would be to remove unified
flushing and let everyone only flush the subtree they care about. Sync
flushing becomes much better (unless you're flushing root ofc), and
concurrent flushing wouldn't cause too many problems (ideally no
thundering herd, and rstat lock can be dropped at cpu boundaries in
cgroup_rstat_flush_locked()).

If we do this, stat reads can be much faster as Ivan demonstrated with
his patch that only flushes the cgroup being read, and we do not
sacrifice accuracy as we never skip flushing. We also do not need a
separate interface for explicit refresh.

In all cases, we need to keep the 2 sec periodic flusher. What we need
to figure out if we remove unified flushing is:

1. Handling stats_flush_threshold.
2. Handling flush_next_time.

Both of these are global now, and will need to be adapted to
non-unified non-global flushing.