Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

From: Leonardo Brás
Date: Fri Feb 03 2023 - 23:56:18 EST

Next message: kernel test robot: "Re: [PATCH v2 11/11] thermal/acpi: Use the thermal framework ACPI API"
Previous message: Srinivas Pandruvada: "[PATCH v2] thermal: intel_powerclamp: Return last requested state as cur_state"
In reply to: Michal Hocko: "Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining"
Next in thread: Roman Gushchin: "Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Michal, Roman,

I understand you have far more experience in this codebase than myself, so
please help me understand what am I missing in my argument for the spinlock
approach.

I honestly want to improve, and your help is really appreciated.

On Wed, 2023-02-01 at 13:41 +0100, Michal Hocko wrote:
> On Tue 31-01-23 08:35:34, Marcelo Tosatti wrote:
> [...]
> > So it would be good to point out a specific problematic
> > testcase/scenario with using the spinlock in this particular case?
>
> Please think about it some more. The sole purpose of the pcp charge
> caching is to avoid atomics because the normal fast path (i.e. no limit
> hit) is a page counter which is an atomic counter. If you go with a spin
> lock for the pcp cache you are just losing some of the advantage of the
> caching. Sure that would be a smaller atomics use than a deeper memcg
> hierarchy but still.

I could read the code that calls consume_stock(), and noticed that you are
probably referencing the atomic operations on memcg->memory->usage (and others)
used in page_counter_try_charge(). It is a single atomic variable per memcg that
is potentially accessed by every cpu. Is that correct?

I understand the cost of an atomic operation such as this is high because of the
inherent high cost of bouncing the variable's cacheline between those cpus.

The difference between 'usage' atomic variable and the spinlock we are proposing
is the scope of the variable: spinlock is defined on a per-cpu structure, and
most of the accesses will come from the local cpu.

According to Intel® 64 and IA-32 Architectures Software Developer’s Manual, at
Vol. 3A page 9-5:

###
9.1.4 Effects of a LOCK Operation on Internal Processor Caches
[...]
For the P6 and more recent processor families, if the area of memory being
locked during a LOCK operation is cached in the processor that is performing the
LOCK operation as write-back memory and is completely contained in a cache line,
the processor may not assert the LOCK# signal on the bus. Instead, it will
modify the memory location internally and allow it’s cache coherency mechanism
to ensure that the operation is carried out atomically. This operation is called
“cache locking.” The cache coherency mechanism automatically prevents two or
more processors that have cached the same area of memory from simultaneously
modifying data in that area.
###

So locking on a percpu spinlock which is cached in the current cpu (happens most
of time) is as cheap as modifying the local cacheline. Since the cachelines are
already modified in the local cpu functions, so the write to memory is batched.

For reference, this is the pahole output for memcg_stock_pcp after my patch. The
spinlock fits in the same cacheline as the changed variables.

struct memcg_stock_pcp {
spinlock_t stock_lock; /* 0 4 */
unsigned int nr_pages; /* 4 4 */
struct mem_cgroup * cached; /* 8 8 */
struct obj_cgroup * cached_objcg; /* 16 8 */
struct pglist_data * cached_pgdat; /* 24 8 */
unsigned int nr_bytes; /* 32 4 */
int nr_slab_reclaimable_b; /* 36 4 */
int nr_slab_unreclaimable_b; /* 40 4 */

/* size: 48, cachelines: 1, members: 8 */
/* padding: 4 */
/* last cacheline: 48 bytes */
};

The only accesses that will come from a remote cpu, and thus cause the cacheline
to bounce and the lock to be more expensive, are the ones from
drain_all_stock(). OTOH, on upstream, those accesses already modify the remote
cacheline with an atomic variable (memcg_stock_pcp.flags), so we are only
dealing with potential contention costs.

>
> Not to mention a potential contention which is hard to predict and it
> will depend on the specific runtime very much. So not something that
> would be easy to test for other than artificial testcases. Full cpu
> pcp draining is not a very common operation and one could argue that
> the problem will be limited
>

Yes, but we are exchanging an "always schedule_work_on()", which is a kind of
contention, for a "sometimes we hit spinlock contention".

For the spinlock proposal, on the local cpu side, the *worst case* contention
is:
1 - wait the spin_unlock() for a complete <percpu cache drain process>,
2 - wait a cache hit for local per-cpu cacheline

What is current implemented (schedule_work_on() approach), for the local
cpu side there is *always* this contention:
1 - wait for a context switch,
2 - wait a cache hit from it's local per-cpu cacheline,
3 - wait a complete <percpu cache drain process>,
4 - then for a new context switch to the current thread.

So moving from schedule_work_on() to spinlocks will save 2 context switches per
cpu every time drain_all_stock() is called.

On the remote cpu side, my tests point that doing the remote draining is faster
than scheduling a local draining, so it's also a gain.

Also, IIUC the possible contention in the spinlock approach happens only on
page-faulting and syscalls, versus the schedule_work_on() approach that can
interrupt user workload at any time.

In fact, not interrupting the user workload in isolated cpus is just a bonus of
using spinlocks.

> but so far I haven't really heard any strong
> arguments against the proposal to avoid scheduling the work on isolated
> cpus which is a much more simpler solution and achieves the same
> effect.

I understand the 'not scheduling work on isolated cpus' appeal: it's a much
simpler change, and will only affect cases with isolated cpus, so it is safer
than changing the whole locking structure.

I am not against it. I already have a patch implementing it for testing, and I
could gladly send it if you see fit.

But if nothing else, it introduces another specific case, and now it have to
deal differently with local cpu, remote cpus, and isolated cpus.

With spinlocks, there is a much simpler solution:
- All cpus are dealt with the same way, which is faster in the local cpu (as
in upstream implementation).
- We don't have to avoid draining on isolated cpus.
- We get rid of the "work_struct", and scheduling costs
- We get rid of "flag", as we don't need to take care of multi-scheduling local
draining.
- When drain_all_stock() returns, all stock have already been drained, so
retrying consume_stock() may have immediate results on pre-OOM cases.

On Wed, 2023-02-01 at 13:52 +0100, Michal Hocko wrote:
[...]
> Let me be clear here. Unless there are some real concerns about not
> flushing remotely on isolated cpus then the spin lock approach is just
> much less preferable. So I would much rather focus on the trivial patch
> and investigates whether there are no new corner cases to be created by
> that.

I understand 'not scheduling work on isolated cpus' approach should work with
low impact on current behavior, and I also understand that as Maintainers you
have to consider many more factors than a regular developer like me, and also
that the spinlock approach is a big change on how pcp memcg caches work.

On that topic, if there is any comparison test you find important running, or
any topic you think could be better discussed, please let me know.

Thank you for reviewing,
Leo

Next message: kernel test robot: "Re: [PATCH v2 11/11] thermal/acpi: Use the thermal framework ACPI API"
Previous message: Srinivas Pandruvada: "[PATCH v2] thermal: intel_powerclamp: Return last requested state as cur_state"
In reply to: Michal Hocko: "Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining"
Next in thread: Roman Gushchin: "Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]