Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

From: Michal Hocko
Date: Wed Feb 01 2023 - 07:52:17 EST


On Wed 01-02-23 01:36:07, Leonardo Brás wrote:
[...]
> Michal, Roman: Could you please review my argumentation below, so I can
> understand what exactly is wrong ?
>
> > >
> > > IIUC, we can approach this by dividing the problem in two working modes:
> > > 1 - Normal, meaning no drain_all_stock() running.
> > > 2 - Draining, grouping together pre-OOM and userspace 'config' : changing,
> > > destroying, reconfiguring a memcg.
> > >
> > > For (1), we will have (ideally) only local cpu working on the percpu struct.
> > > This mode will not have any kind of contention, because each CPU will hold it's
> > > own spinlock only.

You are still hitting locked atomic operations which is not the case for
a simple pcp access. As already mentioned page counter (which would be
taken in case of no pcp cache) is also about atomics (elbeit more of
them with a deeper cgroup hierarchy).

> > > For (2), we will have a lot of drain_all_stock() running. This will mean a lot
> > > of schedule_work_on() running (on upstream) or possibly causing contention, i.e.
> > > local cpus having to wait for a lock to get their cache, on the patch proposal.

Yes.

> > > Ok, given the above is correct:
> > >
> > > # Some arguments point that (1) becomes slower with this patch.
> > >
> > > This is partially true: while test 2.2 pointed that local cpu functions running
> > > time had became slower by a few cycles, test 2.4 points that the userspace
> > > perception of it was that the syscalls and pagefaulting actually became faster:
> > >
> > > During some debugging tests before getting the performance on test 2.4, I
> > > noticed that the 'syscall + write' test would call all those functions that
> > > became slower on test 2.2. Those functions were called multiple millions of
> > > times during a single test, and still the patched version performance test
> > > returned faster for test 2.4 than upstream version. Maybe the functions became
> > > slower, but overall the usage of them in the usual context became faster.

It is hard to tell anything without proper analysis of those numbers.

> > > Is not that a small improvement?
> > >
> > > # Regarding (2), I notice that we fear contention
> > >
> > > While this seems to be the harder part of the discussion, I think we have enough
> > > data to deal with it.
> > >
> > > In which case contention would be a big problem here? 
> > > IIUC it would be when a lot of drain_all_stock() get running because the memory
> > > limit is getting near. I mean, having the user to create / modify a memcg
> > > multiple times a second for a while is not something that is expected, IMHO.

We have already explained when that happens. You are right this is not
happening all that much in most workloads. But this is a user triggered
operation so you cannot really many assumptions. It will really depend
on the workload.

> > Considering that the use of spinlocks with remote draining is the more general solution,
> > what would be a test-case to demonstrate a contention problem?
>
> IIUC we could try to reproduce a memory tight workload that keeps allocating /
> freeing from different cpus (without hitting OOM).
>
> Michal, Roman: Is that correct? You have any workload like that so we can test?

It would be easy to create artificial workload that would hit this. All
you need is to trigger any of the code paths that have already been
mentioned elsewhere from the userspace.

Let me be clear here. Unless there are some real concerns about not
flushing remotely on isolated cpus then the spin lock approach is just
much less preferable. So I would much rather focus on the trivial patch
and investigates whether there are no new corner cases to be created by
that.
--
Michal Hocko
SUSE Labs