Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

From: Marcelo Tosatti
Date: Wed Apr 26 2023 - 12:20:22 EST

Next message: Lee Jones: "Re: [PATCH v6] leds: max597x: Add support for max597x"
Previous message: liuwf: "Library: [RFC PATCH] btree_blue - A simple btree with fast linear travesal and more. V4. and test data"
In reply to: Vlastimil Babka: "Re: [PATCH v7 00/13] fold per-CPU vmstats remotely"
Next in thread: Michal Hocko: "Re: [PATCH v7 00/13] fold per-CPU vmstats remotely"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Apr 26, 2023 at 05:04:49PM +0200, Vlastimil Babka wrote:
> On 4/20/23 15:45, Marcelo Tosatti wrote:
> > Perhaps the complexity should be judged for individual cases
> > of interruptions, and if a given interruption-free conversion
> > is seen as too complex, then a "disable feature which makes use of per-CPU
> > caches" style solution can be made (and then userspace has to
> > explicitly request for that per-CPU feature to be disabled).
> >
> > But i don't see that this patchset introduces unmanageable complexity,
> > neither:
> >
> > 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
> > 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock
>
> Well that one is a bit different, as there was one kind of lock replaced
> with other kind of lock,

local_lock is defined to NULL if CONFIG_PREEMPT_RT is not set.
So for the !CONFIG_PREEMPT_RT case, it introduced a lock.

> the lock is uncontended unless there's remote
> flushes happening so it's not causing extra overhead for the fast paths,
> and later even the irq disabling was removed, which should even improve
> things. But this patchset is turning all vmstat counter increments a
> cmpxchg.

Yes, and we have a similar situation in this case:

1) CMPXCHG is already used to protect many vmstat counter increments.
2) The patchset adds "LOCK CMPXCHG" to existing CMPXCHG user.
3) The performance decrease is negligible, because cache locking is
effective.

"To test the performance difference, a page allocator microbenchmark:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.

For the single_page_alloc_free test, which does

/** Loop to measure **/
for (i = 0; i < rec->loops; i++) {
my_page = alloc_page(gfp_mask);
if (unlikely(my_page == NULL))
return 0;
__free_page(my_page);
}

Unit is cycles.

Vanilla Patched Diff
115.25 117 1.4%"

To be honest, that 1.4% difference was not stable but fluctuated between
positive and negative percentages (so the performance difference was in
the noise).

So performance is not a decisive factor in this case.

Next message: Lee Jones: "Re: [PATCH v6] leds: max597x: Add support for max597x"
Previous message: liuwf: "Library: [RFC PATCH] btree_blue - A simple btree with fast linear travesal and more. V4. and test data"
In reply to: Vlastimil Babka: "Re: [PATCH v7 00/13] fold per-CPU vmstats remotely"
Next in thread: Michal Hocko: "Re: [PATCH v7 00/13] fold per-CPU vmstats remotely"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]