Re: [RFC patch 4/7] futex: Add support for attached futexes

From: Ingo Molnar
Date: Sun Apr 03 2016 - 07:17:03 EST



* Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:

> The standard futex mechanism in the Linux kernel uses a global hash to store
> transient state. Collisions on that hash can lead to performance degradation
> and on real-time enabled kernels even to priority inversions.
>
> To guarantee futexes without collisions on the global kernel hash, we provide
> a mechanism to attach to a futex. This creates futex private state which
> avoids hash collisions and on NUMA systems also cross node memory access.
>
> To utilize this mechanism each thread has to attach to the futex before any
> other operations on that futex.
>
> The inner workings are as follows:
>
> Attach:
>
> sys_futex(FUTEX_ATTACH | FUTEX_ATTACHED, uaddr, ....);
>
> If this is the first attach to uaddr then a 'global state' object is
> created. This global state contains a futex hash bucket and a futex_q
> object which is enqueued into the global hash for reference so subsequent
> attachers can find it. Each attacher takes a reference count on the
> 'global state' object and hashes 'uaddr' into a thread local hash. This
> thread local hash is lock free and dynamically expanded to avoid
> collisions. Each populated entry in the thread local hash stores 'uaddr'
> and a pointer to the 'global state' object.
>
> Futex ops:
>
> sys_futex(FUTEX_XXX | FUTEX_ATTACHED, uaddr, ....);
>
> If the attached flag is set, then 'uaddr' is hashed and the thread local
> hash is checked whether the hash entry contains 'uaddr'. If no, an error
> code is returned. If yes, the hash slot number is stored in the futex key
> which is used for further operations on the futex. When the hash bucket is
> looked up then attached futexes will use the slot number to retrieve the
> pointer to the 'global state' object and use the embedded hash bucket for
> the operation. Non-attached futexes just use the global hash as before.
>
> Detach:
>
> sys_futex(FUTEX_DETACH | FUTEX_ATTACHED, uaddr, ....);
>
> Detach removes the entry in the thread local hash and decrements the
> refcount on the 'global state' object. Once the refcount drops to zero the
> 'global state' object is removed from the global hash and destroyed.
>
> Thread exit cleans up the thread local hash and the 'global state' objects
> as we do for other futex related storage already.
>
> The thread local hash and the 'global state' object are allocated on the node
> on which the attaching thread runs.
>
> Attached mode works with all futex operations and with both private and shared
> futexes. For operations which involve two futexes, i.e. FUTEX_REQUEUE_* both
> futexes have to be either attached or detached (like FUTEX_PRIVATE).
>
> Why not auto attaching?
>
> Auto attaching has the following problems:
>
> - Memory consumption
> - Life time issues
> - Performance issues due to the necessary allocations

But those are mostly setup only costs, right?

So I don't think this conclusion is necessarily true, even on smaller systems:

> So, no. It must be opt-in and reserved for explicit isolation purposes.
>
> A modified version of 'perf bench futex hash' shows the following results:

and look at the very measurable performance advantages on a small NUMA system:

Before:

> Averaged 1451441 operations/sec (+- 3.65%), total secs = 60

After:

> Averaged 1709712 operations/sec (+- 4.67%), total secs = 60

> That's a performance increase of 18%.

... and I suspect that on a larger NUMA system the speedup is probably a lot more
pronounced.

Also, the thing is, allocation/deallocation costs are a second order concern IMHO,
because most of the futex's usage is the lock/unlock operations.

So my prediction: in real life large systems will want to have collision-free
futexes most of the time, and they don't want to modify every futex using
application or library. So this is a mostly kernel side system sizing
question/decision, not really a user-side system purpose policy question.

So an ABI distinction and offloading the decision to every single application that
wants to use it and hardcode it into actual application source code via an ABI is
pretty much the _WORST_ way to go about it IMHO...

So how about this: don't add any ABI details, but make futexes auto-attached on
NUMA systems (and obviously PREEMPT_RT systems)?

I.e. make it a build time or boot time decision at most, don't start a messy
'should we used attached futexes or not' decisions on the ABI side, which we know
from Linux ABI history won't be answered and utilized very well by applications!

Thanks,

Ingo