Re: [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq

From: Mathieu Desnoyers
Date: Tue May 23 2023 - 13:30:23 EST


On 2023-05-23 12:32, Noah Goldstein wrote:
On Tue, May 23, 2023 at 7:49 AM Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:

On 2023-05-19 16:51, Noah Goldstein wrote:
On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha
<libc-alpha@xxxxxxxxxxxxxx> wrote:

Expose the "on-cpu" state for each thread through struct rseq to allow
adaptative mutexes to decide more accurately between busy-waiting and
calling sys_futex() to release the CPU, based on the on-cpu state of the
mutex owner.

It is only provided as an optimization hint, because there is no
guarantee that the page containing this field is in the page cache, and
therefore the scheduler may very well fail to clear the on-cpu state on
preemption. This is expected to be rare though, and is resolved as soon
as the task returns to user-space.

The goal is to improve use-cases where the duration of the critical
sections for a given lock follows a multi-modal distribution, preventing
statistical guesses from doing a good job at choosing between busy-wait
and futex wait behavior.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Cc: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Cc: Jonathan Corbet <corbet@xxxxxxx>
Cc: Steven Rostedt (Google) <rostedt@xxxxxxxxxxx>
Cc: Carlos O'Donell <carlos@xxxxxxxxxx>
Cc: Florian Weimer <fweimer@xxxxxxxxxx>
Cc: libc-alpha@xxxxxxxxxxxxxx
---
include/linux/sched.h | 12 ++++++++++++
include/uapi/linux/rseq.h | 17 +++++++++++++++++
kernel/rseq.c | 14 ++++++++++++++
3 files changed, 43 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eed5d65b8d1f..c7e9248134c1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
rseq_handle_notify_resume(ksig, regs);
}

+void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
+
+static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
+{
+ if (t->rseq)
+ __rseq_set_sched_state(t, state);
+}
+
/* rseq_preempt() requires preemption to be disabled. */
static inline void rseq_preempt(struct task_struct *t)
{
__set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
rseq_set_notify_resume(t);
+ rseq_set_sched_state(t, 0);

Should rseq_migrate also be made to update the cpu_id of the new core?
I imagine the usage of this will be something along the lines of:

if(!on_cpu(mutex->owner_rseq_struct) &&
cpu(mutex->owner_rseq_struct) == this_threads_cpu)
// goto futex

So I would think updating on migrate would be useful as well.

I don't think we want to act differently based on the cpu on which the
owner is queued.

If the mutex owner is not on-cpu, and queued on the same cpu as the
current thread, we indeed want to call sys_futex WAIT.

If the mutex owner is not on-cpu, but queued on a different cpu than the
current thread, we *still* want to call sys_futex WAIT, because
busy-waiting for a thread which is queued but not currently running is
wasteful.

I think this is less clear. In some cases sure but not always. Going
to the futex
has more latency that userland waits, and if the system is not busy (other than
the one process) most likely less latency that yield. Also going to the futex
requires a syscall on unlock.

For example if the critical section is expected to be very small, it
would be easy
to imagine the lock be better implemented with:
while(is_locked)
if (owner->on_cpu || owner->cpu != my_cpu)
exponential backoff
else
yield

Its not that "just go to futex" doesn't ever make sense, but I don't
think its fair
to say that *always* the case.

Looking at the kernel code, it doesn't seem to be a particularly high cost to
keep the CPU field updated during migration so seems like a why not
kind of question.

We already have the owner rseq_abi cpu_id field populated on every return-to-userspace. I wonder if it's really relevant that migration populates an updated value in this field immediately ? It's another case where this would be provided as a hint updated only if the struct rseq is in the page cache, because AFAIU the scheduler migration path cannot take a page fault.

Also, if a thread bounces around many runqueues before being scheduled again, we would be adding those useless stores to the rseq_abi structure at each migration between runqueues.

Given this would add some complexity to the scheduler migration code, I would want to see metrics/benchmarks showing that it indeed improves real-world use-cases before adding this to the rseq ABI.

It's not only a question of added lines of code as of today, but also a question of added userspace ABI guarantees which can prevent future scheduler optimizations. I'm *very* careful about keeping those to a strict minimum, which I hope Peter Zijlstra appreciates.

Thanks,

Mathieu


Or am I missing something ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com