Re: [RFC patch 4/7] futex: Add support for attached futexes

From: Thomas Gleixner
Date: Sun Apr 03 2016 - 06:07:03 EST


On Sun, 3 Apr 2016, Rasmus Villemoes wrote:
> On Sat, Apr 02 2016, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> > The standard futex mechanism in the Linux kernel uses a global hash to store
> > transient state. Collisions on that hash can lead to performance degradation
> > and on real-time enabled kernels even to priority inversions.
> >
> > To guarantee futexes without collisions on the global kernel hash, we provide
> > a mechanism to attach to a futex. This creates futex private state which
> > avoids hash collisions and on NUMA systems also cross node memory access.
>
> Hi,
>
> A few minor comments inline below, and a question about the design:
>
> How is an application supposed to handle it when the kernel fails to
> achieve the no collision-goal? With any reasonable upper bound on the
> size of the local hash table (which of course has to be there, whether
> sysctl'ed or not), and regardless of the hashing scheme used, it
> seems inevitable that someone is going to get -ENOSPC when trying to
> attach. Moreover, since different threads can attach to different sets
> of futexes, one thread may succesfully attach to a futex, while another
> fails - the second thread is then permanently prevented from operating
> on that futex (?).

Yes. There is not much we can do about that except adding it to the
documentation.

> Why not use some sort of tree instead? Or fall back to a traditional
> chained hash table once we're no longer allowed to increase the table
> size? Of course these have worse lookup performance, and maybe failing
> the attach in the rare case is better than penalizing the common case,
> but it would be nice to have some mention of this in the change log.

The lookup performance is critical for the futex ops (wait/wake ....)

> Alternatively [this is not really thought through], maybe one could move
> the decision and the complexity to userspace: On succesful FUTEX_ATTACH,
> return an index into a small per-task array of struct futex_state*. On
> subsequent FUTEX_ATTACHED operations on that futex, userspace passes in
> this index somehow (either instead of uaddr, in which case the kernel
> array would have to include this in addition to the futex_state pointer,
> or by making uaddr actually point to a struct { int *futex_addr; int
> attach_idx; }, or...) Then each thread would have to maintain a (futex
> address => index) mapping, but that's more or less what the kernel
> otherwise has to do.

Right. We tried this as a first attempt. That moves the complexity of hashing
futex to index per thread to user space. It can be done simple enough, though
the resulting

> > + case 4096: return 4093;
> > + }
> > +}
> > +
>
> There should probably be some mention of TASK_CACHE_{BASE,MAX}_SIZE here
> so that anyone updating those would also look here, so we don't end up
> using 13 out of 8192 slots... BUILD_BUG_ON(TASK_CACHE_{BASE,MAX}_SIZE
> != {16,4096}) should do it.

As Peter pointed out there is hash_ptr() already so this will go away.

> > + slot = find_first_bit(map, size);
> > + for (; slot < size; slot = find_next_bit(map, size, slot + 1)) {
> > + newslot = hash_local_futex(tc->slots[slot].uaddr, prime);
> > + /*
> > + * Paranoia. Rehashing to a larger cache should not result in
> > + * collisions which did not exist in the small one.
> > + */
>
> This doesn't sound right when you're doing mod prime hashing; two
> numbers can easily have the same remainder mod 31 while having distinct
> remainders mod 13.

I know, but as far as my extensive testing went I never hit that case.

> > +
> > + /* Populate the new cache and return the slot number */
> > + current->futex_cache = tcnew;
> > + return slot;
> > +}
>
> I may be misreading it, but this seems to leak the old ->futex_cache?

Duh, you are right.

Thanks,

tglx