Re: [PATCH] signal: Allow tasks to cache one sigqueue struct (again).

From: Linus Torvalds
Date: Tue Nov 29 2022 - 13:23:13 EST


On Tue, Nov 29, 2022 at 7:15 AM Sebastian Andrzej Siewior
<bigeasy@xxxxxxxxxxxxx> wrote:
>
> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>
> The idea for this originates from the real time tree to make signal
> delivery for realtime applications more efficient.

NAK.

This was attempted once. It had unexplained problems, and very unclear locking.

And this new version is THE SAME CRAZY STUFF. That whole
sigqueue_cache_or_free() hasn't been improved, and is just a thousand
monkeys on crack typing wildly on a keyboard.

This updated patch seems to *attempt* to do everything under the
sighand->siglock, and several of the comments point to it.

But - exactly like the first version - that isn't actually always
true, so it then does *other* random tests instead.

And in at least one case - sigqueue_free() - this code *literally* does that

spin_unlock_irqrestore(lock, flags);

if (q)
__sigqueue_cache_or_free(q);

where that "spin_unlock_irqrestore()" just released 'sighand->siglock'
which is what allegedly is protecting this all.

In other words, STOP DOING THIS.

Insanity is doing the same thing over and over again, expecting
different results. This is insanity.

We have a couple of alternatives here:

(a) do it *properly* and obviously under that sighand->lock in EVERY
SINGLE PATH.

None of these games with random code paths either having the lock
conditionally (the 'sigqueue_flags' in __sigqueue_alloc()).

And none of these "I just dropped the lock, and then free the cache"

(b) alternatively, do no locking at all, and depend on 'current'
being single-threaded, and _only_ do it in task-synchronous contexts
(so no interrupts)

(c) use proper atomics that don't *need* locking (ie "xchg(p,NULL)"
kind of things)

but no, we do not repeat the mistake of mixing (a) and (b) with random code.

And exactly because we've already done this once - badly - I want the
new code to be *obviously* correct.

That very much means "don't try to revive a patch that was already
reverted because it was broken and didn't make sense". Throw this
patch AWAY. It's not worth reviving. Start from scratch, with that
"this needs to be _obviously_ correct" as the starting point.

Honestly, I suspect the proper approach is:

- absolutely *no* "test if task is dead". Free the last left-over
cached entry as you free the 'struct task_struct".

Stop doing these "test random other unrelated flags".

If you cannot based it on just the cache field and the lock, DON'T
DO IT AT ALL.

- absolutely no "test the 'sigqueue_flags' argument".

Again, stop doing these "test random flags".

Either only do the "do I have a cached entry" in the path that
unambiguously already holds the lock!

Or do a cache-specific lock (ie "use atomics") so that it doesn't
*matter* whether the lock is held.

- make every step *so* obvious that you don't even need a comment.

And then add the comment about it anyway.

Ok? Because I never want to see this broken approach again.

Linus