Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

From: Waiman Long
Date: Sun Jun 06 2021 - 18:14:06 EST

Next message: Hauke Mehrtens: "Re: [PATCH net] lantiq: net: fix duplicated skb in rx descriptor ring"
Previous message: Cristian Marussi: "[RFC PATCH 10/10] firmware: arm-scmi: Make smc transport atomic"
In reply to: Linus Torvalds: "Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression"
Next in thread: Feng Tang: "Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 6/6/21 3:20 PM, Linus Torvalds wrote:

[ Adding Waiman Long to the participants, because this seems to be a
very specific cacheline alignment behavior of rwsems, maybe Waiman has
some comments ]

On Sun, Jun 6, 2021 at 3:16 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:

* perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
data structure change

- old kernel

- first cacheline
mmap_lock->count (75%)
mm->mapcount (14%)

- second cacheline
mmap_lock->owner (97%)

- new kernel

mainly in the cacheline of 'mmap_lock'

mmap_lock->count (~2%)
mmap_lock->owner (95%)

Oooh.

It looks like pretty much all the contention is on mmap_lock, and the
difference is that the old kernel just _happened_ to split the
mmap_lock rwsem at *exactly* the right place.

The rw_semaphore structure looks like this:

struct rw_semaphore {
atomic_long_t count;
atomic_long_t owner;
struct optimistic_spin_queue osq; /* spinner MCS lock */
...

and before the addition of the 'write_protect_seq' field, the mmap_sem
was at offset 120 in 'struct mm_struct'.

Which meant that count and owner were in two different cachelines, and
then when you have contention and spend time in
rwsem_down_write_slowpath(), this is probably *exactly* the kind of
layout you want.

Because first the rwsem_write_trylock() will do a cmpxchg on the first
cacheline (for the optimistic fast-path), and then in the case of
contention, rwsem_down_write_slowpath() will just access the second
cacheline.

Which is probably just optimal for a load that spends a lot of time
contended - new waiters touch that first cacheline, and then they
queue themselves up on the second cacheline. Waiman, does that sound
believable?

Yes, I think so.

The count field is accessed when a task tries to acquire the rwsem or when a owner releases the lock. If the trylock fails, the writer will go into the slowpath doing optimistic spinning on the owner field. As a result, a lot of reads to owner are issued relative to the read/write of count. Normally, there should only be one spinner that has the OSQ lock spinning on owner and the 9% performance degradation seems a bit high to me. In the rare case that the head waiter in the wait queue sets the handoff flag, the waiter may also spin on owner causing a bit more contention on the owner cacheline. I will do further investigation on this possibility when I have time.

Cheers,
Longman

Next message: Hauke Mehrtens: "Re: [PATCH net] lantiq: net: fix duplicated skb in rx descriptor ring"
Previous message: Cristian Marussi: "[RFC PATCH 10/10] firmware: arm-scmi: Make smc transport atomic"
In reply to: Linus Torvalds: "Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression"
Next in thread: Feng Tang: "Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]