[RFC PATCH 0/4] Improve RT performance & latency w/ local_lock_t()

From: Leonardo Bras
Date: Sat Jul 29 2023 - 04:39:32 EST


The problem:
We have a few scenarios in mm that make use of local_locks() +
schedule_work_on() due to reduced overhead on the most common local
references. This scenario is not ideal for RT workloads though: even on
isolated cpus, those tasks will schedule-out the sensitive RT workloads to
perform those jobs, and usually cause missing deadlines with this.

The idea:
In PREEMPT_RT, local_locks() end up becoming spinlocks(), so there should
be no harm in just getting another cpu's spinlock to perform the work
on the per-cpu structure: the cacheline invalidation will already happen
due to the usage by schedule_work_on(), and on local functions the locking
already happens anyway.

This will avoid schedule_work_on(), and thus avoid scheduling-out an
RT workload. Given the usually brief nature of those scheduled tasks, their
execution end up being faster than doing their scheduling.

======

While I really belive the solution, there are problems with this patchset,
which I need your suggestions for improvement:

1) Naming is horrible: I could not think on a good name for the new lock
functions, so I lazely named it local_lock_n().
The naming should have been making clear that we are in fact dealing
with a local_lock but it can in some scenarios be addressing another cpu's
local_lock, and thus we need the extra cpu parameter.

Dealing with this local & remote duality, all I thought was:
mostly_local_lock(), (or local_mostly_lock())
local_maybe_remote_lock() <- LOL
remote_local_lock()
per_cpu_local_lock()
local_lock_cpu()

Please help !


2) Maybe naming is this hard because the interface is not the best.
My idea was to create a simple interface to easily replace functions
already in use, but maybe there is something better I could not see.

Please contribute!

3) I am lazely setting work->data without atomic operations, which may
be bad in some scenario. If so, even thought it can be costly, since it
happens outside of the hotpath (local per-cpu areas) it should be no
problem adding the atomic operation for non-RT kernels.

For RT kernels, since the whole operation hapens locally, there should be
no need of using the atomic set.

Please let me know of any idea, or suggestion that can improve this RFC.

Thanks a lot!
Leo

Leonardo Bras (4):
Introducing local_lock_n() and local queue & flush
swap: apply new local_schedule_work_on() interface
memcontrol: apply new local_schedule_work_on() interface
slub: apply new local_schedule_work_on() interface

include/linux/local_lock.h | 18 ++++++++++
include/linux/local_lock_internal.h | 52 +++++++++++++++++++++++++++++
mm/memcontrol.c | 17 ++++++----
mm/slub.c | 17 ++++++----
mm/swap.c | 18 +++++-----
5 files changed, 100 insertions(+), 22 deletions(-)

--
2.41.0