RE: [PATCH v2] locking/osq_lock: Avoid false sharing in optimistic_spin_node

From: David Laight
Date: Sat Dec 23 2023 - 08:06:55 EST


From: Zeng Heng
> Sent: 23 December 2023 08:55
>
> 在 2023/12/22 20:40, David Laight 写道:
> > From: Zeng Heng
> >> Sent: 22 December 2023 12:11
> >>
> >> Using the UnixBench test suite, we clearly find that osq_lock() cause
> >> extremely high overheads with perf tool in the File Copy items:
> >>
> >> Overhead Shared Object Symbol
> >> 94.25% [kernel] [k] osq_lock
> >> 0.74% [kernel] [k] rwsem_spin_on_owner
> >> 0.32% [kernel] [k] filemap_get_read_batch
> >>
> >> In response to this, we conducted an analysis and made some gains:
> >>
> >> In the prologue of osq_lock(), it set `cpu` member of percpu struct
> >> optimistic_spin_node with the local cpu id, after that the value of the
> >> percpu struct would never change in fact. Based on that, we can regard
> >> the `cpu` member as a constant variable.
> >>
> > ...
> >> @@ -9,7 +11,13 @@
> >> struct optimistic_spin_node {
> >> struct optimistic_spin_node *next, *prev;
> >> int locked; /* 1 if lock acquired */
> >> - int cpu; /* encoded CPU # + 1 value */
> >> +
> >> + CACHELINE_PADDING(_pad1_);
> >> + /*
> >> + * Stores an encoded CPU # + 1 value.
> >> + * Only read by other cpus, so split into different cache lines.
> >> + */
> >> + int cpu;
> >> };
> > Isn't this structure embedded in every mutex and rwsem (etc)?
> > So that is a significant bloat especially on systems with
> > large cache lines.

This code is making my head hurt :-)
The 'spin_node' does only exist per-cpu.

> > Did you try just moving the initialisation of the per-cpu 'node'
> > below the first fast-path (uncontended) test in osq_lock()?

Reading more closely they do need to be valid before the fast-path
cmpxchg.
But I suspect the 'cache line dirty' could be done conditionally
or in the unlock/fail path.

I think the unlock fast-path always has node->next == NULL and it is
set to NULL in the slow path.
The lock-fail path calls osq_wait_next() - which also NULLs it.
So maybe it is always NULL on entry anyway?

node->locked is set by the slow-path lock code.
So could be cleared when checked or any time before the unlock returns.
Possibly unconditionally in the unlock slow path and conditionally
in the unlock fast path?

I think that would mean the assignment to node in osq_lock() could
be moved below the first xchg() (provided 'node' can be initialised).

I also wonder what the performance difference is between
smp_processor_id() and this_cpu_ptr(&osq_node)?

Bloating all the mutex/rwsem by 4 bytes (on 64bit) and changing
lock->tail to 'struct optimistic_spin_node *' (and moving it's
definition into the .c file) may well improve performance?


> >
> > OTOH if you really have multiple cpu spinning on the same rwsem
> > perhaps the test and/or filemap code are really at fault!
> >
> > David
>
> Hi,
>
> The File Copy items of UnixBench testsuite are using 1 read file and 1
> write file
>
> for file read/write/copy test. In multi-parallel scenario, that would
> lead to high file lock contention.
>
> That is just a performance test suite and has nothing to do with whether
> the user program design is correct or not.

But it might be stressing some code paths that don't usually happen.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)