Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq

From: Mathieu Desnoyers
Date: Thu Sep 28 2023 - 18:12:07 EST


On 9/28/23 16:54, Thomas Gleixner wrote:
On Mon, May 29 2023 at 15:14, Mathieu Desnoyers wrote:
+/*
+ * rseq_sched_state should be aligned on the cache line size.
+ */
+struct rseq_sched_state {
+ /*
+ * Version of this structure. Populated by the kernel, read by
+ * user-space.
+ */
+ __u32 version;
+ /*
+ * The state is updated by the kernel. Read by user-space with
+ * single-copy atomicity semantics. This field can be read by any
+ * userspace thread. Aligned on 32-bit. Contains a bitmask of enum
+ * rseq_sched_state_flags. This field is provided as a hint by the
+ * scheduler, and requires that the page holding this state is
+ * faulted-in for the state update to be performed by the scheduler.
+ */
+ __u32 state;
+ /*
+ * Thread ID associated with the thread registering this structure.
+ * Initialized by user-space before registration.
+ */
+ __u32 tid;
+};
+
/*
* struct rseq is aligned on 4 * 8 bytes to ensure it is always
* contained within a single cache-line.
@@ -148,6 +180,15 @@ struct rseq {
*/
__u32 mm_cid;
+ __u32 padding1;
+
+ /*
+ * Restartable sequences sched_state_ptr field. Initialized by
+ * userspace to the address at which the struct rseq_sched_state is
+ * located. Read by the kernel on rseq registration.
+ */
+ __u64 sched_state_ptr;
+

Why is this a separate entity instead of being simply embedded into
struct rseq?

Neither the code comment nor the changelog tells anything about that.

Here is the email thread from v1 that led to this:

https://lore.kernel.org/lkml/ZGevZxOjJLMO9zlM@boqun-archlinux/

The reason for moving this sched_state to its own structure was to optimize for a scenario where we have:

- many threads contending for the lock, thus loading the value of the struct rseq "sched_state".
- the lock owner thread doing rseq critical sections with the lock held, thus updating the value of struct rseq "rseq_cs" field (in the same cache line).

The loads of the sched_state from other threads would slow down (even slightly) the update to the rseq_cs field, thus causing the critical sections to take a few more cycles.

I am not convinced that the complexity vs speed tradeoff of moving this into its own separate structure is worth it though. Especially given that it would require userspace to wire things up with an additional side-structure, rather than just extend naturally with the extensible-rseq ABI. Another approach would be to cacheline-align this field within struct rseq and waste space to padding.

I am inclined to just embed this into struct rseq without caring too much about slight overhead caused by sharing the cache line with other fields.


If your intention was to provide a solution for process shared futexes
then you completely failed to understand how process shared futexes
work. If not, then please explain why you need a pointer and the
associated hackery to deal with it.

I have not given a single thought to shared futexes in this PoC so far. :)

So let's see: we could do the following to support shared futexes: move the information about the owner's struct rseq (could be the thread pointer of the owner thread) to a field separate from the "lock" state (which is ABI as you pointed out). Therefore, grabbing the lock would be done with an atomic instruction, and then setting this extra information about lock owner would be done with a simple store.

Given that the lock owner information is only used to statistically decide whether other threads should spin or block when contended, we don't really care if this information is set a few instructions after acquiring the lock.

Thoughts ?

Thanks for the feedback,

Mathieu


Thanks,

tglx



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com