[RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable

From: Ingo Molnar
Date: Mon Dec 09 2019 - 04:18:20 EST



* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> The reason it is buggy is that wait_event_interruptible_exclusive()
> does this (inside the __wait_event() macro that it expands to):
>
> long __int = prepare_to_wait_event(&wq_head, &__wq_entry, state);
>
> if (condition)
> break;
> if (___wait_is_interruptible(state) && __int) {
> __ret = __int;
> goto __out;
>
> and the thing is, if does that "__ret = __int" case and returns
> -ERESTARTSYS, it's possible that the wakeup event has already been
> consumed, because we've added ourselves as an exclusive writer to the
> queue. So it _says_ it was interrupted, not woken up, and the wait got
> cancelled, but because we were an exclusive waiter, we might be the
> _only_ thing that got woken up, and the wakeup basically got forgotten
> - all the other exclusive waiters will remain waiting.

So the place that detects interruption is prepare_to_wait_event():

long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state)
{
unsigned long flags;
long ret = 0;

spin_lock_irqsave(&wq_head->lock, flags);
if (signal_pending_state(state, current)) {
/*
* Exclusive waiter must not fail if it was selected by wakeup,
* it should "consume" the condition we were waiting for.
*
* The caller will recheck the condition and return success if
* we were already woken up, we can not miss the event because
* wakeup locks/unlocks the same wq_head->lock.
*
* But we need to ensure that set-condition + wakeup after that
* can't see us, it should wake up another exclusive waiter if
* we fail.
*/
list_del_init(&wq_entry->entry);
ret = -ERESTARTSYS;
} else {
if (list_empty(&wq_entry->entry)) {
if (wq_entry->flags & WQ_FLAG_EXCLUSIVE)
__add_wait_queue_entry_tail(wq_head, wq_entry);
else
__add_wait_queue(wq_head, wq_entry);
}
set_current_state(state);
}
spin_unlock_irqrestore(&wq_head->lock, flags);

return ret;
}

I think we can indeed lose an exclusive event here, despite the comment
that argues that we shouldn't: if we were already removed from the list
then list_del_init() does nothing and loses the exclusive event AFAICS.

This logic was introduced in this commit 3 years ago:

b1ea06a90f52: ("sched/wait: Avoid abort_exclusive_wait() in ___wait_event()")
eaf9ef52241b: ("sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()")

Before that commit we simply did this:

long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state)
{
unsigned long flags;
-
- if (signal_pending_state(state, current))
- return -ERESTARTSYS;

Which was safe in the sense that it didn't touch the waitqueue in case of
interruption. Then it used abort_exclusive_wait(), which got removed in
eaf9ef52241b, to pass on any leftover wakeups:

-void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, void *key)
-{
- unsigned long flags;
-
- __set_current_state(TASK_RUNNING);
- spin_lock_irqsave(&q->lock, flags);
- if (!list_empty(&wait->task_list))
- list_del_init(&wait->task_list);
- else if (waitqueue_active(q))
- __wake_up_locked_key(q, TASK_NORMAL, key);
- spin_unlock_irqrestore(&q->lock, flags);
-}
-EXPORT_SYMBOL(abort_exclusive_wait);

Note how the wakeup is passed along to another exclusive waiter (if any)
via the __wake_up_locked_key() call.

Oleg?

> I dunno. Maybe this is fundamental to the interface? We do not have a
> lot of users that mix "interruptible" and "exclusive". In fact, I see
> only two cases that care about the return value, but at least the fuse
> use does seem to have exactly this problem with potentially lost
> wakeups because the ERESTARTSYS case ends up eating a wakeup without
> doing anything about it.

I don't think it's fundamental to the interface.

In fact I'd argue that it's fundamental to the interface to *not* lose
exclusive events.

> Looks like the other user - the USB gadget HID thing - also has this
> buggy pattern of believing the return value, and losing a wakeup
> event.
>
> Adding Miklos and Felipe to the cc just because of the fuse and USB
> gadget potential issues, but this is mainly a scheduler maintainer
> question.
>
> It's possible that I've misread the wait-event code. PeterZ?

I think your analysis is correct, and I think the statistical evidence is
also overwhelming that the interface is buggy: if we include your new
usecase then 3 out of 3 attempts got it wrong. :-)

So I'd argue we should fix the interface and allow the 'simple' use of
reliable interruptible-exclusive waitqueues - i.e. reintroduce the
abort_exclusive_wait() logic?

Thanks,

Ingo