Re: [PATCH wq/for-4.0-fixes v2] workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE

From: Tomeu Vizoso
Date: Tue Mar 03 2015 - 08:57:45 EST


On 3 March 2015 at 14:45, Tejun Heo <tj@xxxxxxxxxx> wrote:
> Hello,
>
> Found the culprit. Plain wake_up() shouldn't be used on
> bit_waitqueues. The following patch should fix the issue. I replaced
> the patch in the wq branches.

Yup, this looks good from here.

Thanks,

Tomeu

> Thanks a lot.
>
> ----- 8< -----
> From bba5a377507efef69214108c0d3790cadfab21d4 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@xxxxxxxxxx>
> Date: Tue, 3 Mar 2015 08:43:09 -0500
>
> cancel[_delayed]_work_sync() are implemented using
> __cancel_work_timer() which grabs the PENDING bit using
> try_to_grab_pending() and then flushes the work item with PENDING set
> to prevent the on-going execution of the work item from requeueing
> itself.
>
> try_to_grab_pending() can always grab PENDING bit without blocking
> except when someone else is doing the above flushing during
> cancelation. In that case, try_to_grab_pending() returns -ENOENT. In
> this case, __cancel_work_timer() currently invokes flush_work(). The
> assumption is that the completion of the work item is what the other
> canceling task would be waiting for too and thus waiting for the same
> condition and retrying should allow forward progress without excessive
> busy looping
>
> Unfortunately, this doesn't work if preemption is disabled or the
> latter task has real time priority. Let's say task A just got woken
> up from flush_work() by the completion of the target work item. If,
> before task A starts executing, task B gets scheduled and invokes
> __cancel_work_timer() on the same work item, its try_to_grab_pending()
> will return -ENOENT as the work item is still being canceled by task A
> and flush_work() will also immediately return false as the work item
> is no longer executing. This puts task B in a busy loop possibly
> preventing task A from executing and clearing the canceling state on
> the work item leading to a hang.
>
> task A task B worker
>
> executing work
> __cancel_work_timer()
> try_to_grab_pending()
> set work CANCELING
> flush_work()
> block for work completion
> completion, wakes up A
> __cancel_work_timer()
> while (forever) {
> try_to_grab_pending()
> -ENOENT as work is being canceled
> flush_work()
> false as work is no longer executing
> }
>
> This patch removes the possible hang by updating __cancel_work_timer()
> to explicitly wait for clearing of CANCELING rather than invoking
> flush_work() after try_to_grab_pending() fails with -ENOENT. The
> explicit wait uses the matching bit waitqueue for the CANCELING bit.
>
> Link: http://lkml.kernel.org/g/20150206171156.GA8942@xxxxxxxx
>
> v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
> the target bit waitqueue has wait_bit_queue's on it. Use
> DEFINE_WAIT_BIT() and __wake_up_bit() instead. Reported by Tomeu
> Vizoso.
>
> Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
> Reported-by: Rabin Vincent <rabin.vincent@xxxxxxxx>
> Cc: Tomeu Vizoso <tomeu.vizoso@xxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Tested-by: Jesper Nilsson <jesper.nilsson@xxxxxxxx>
> Tested-by: Rabin Vincent <rabin.vincent@xxxxxxxx>
> ---
> include/linux/workqueue.h | 3 ++-
> kernel/workqueue.c | 37 +++++++++++++++++++++++++++++++++----
> 2 files changed, 35 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index 74db135..f597846 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -70,7 +70,8 @@ enum {
> /* data contains off-queue information when !WORK_STRUCT_PWQ */
> WORK_OFFQ_FLAG_BASE = WORK_STRUCT_COLOR_SHIFT,
>
> - WORK_OFFQ_CANCELING = (1 << WORK_OFFQ_FLAG_BASE),
> + __WORK_OFFQ_CANCELING = WORK_OFFQ_FLAG_BASE,
> + WORK_OFFQ_CANCELING = (1 << __WORK_OFFQ_CANCELING),
>
> /*
> * When a work item is off queue, its high bits point to the last
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index f288493..cfbae1d 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2730,17 +2730,37 @@ EXPORT_SYMBOL_GPL(flush_work);
>
> static bool __cancel_work_timer(struct work_struct *work, bool is_dwork)
> {
> + wait_queue_head_t *waitq = bit_waitqueue(&work->data,
> + __WORK_OFFQ_CANCELING);
> unsigned long flags;
> int ret;
>
> do {
> ret = try_to_grab_pending(work, is_dwork, &flags);
> /*
> - * If someone else is canceling, wait for the same event it
> - * would be waiting for before retrying.
> + * If someone else is already canceling, wait for it to
> + * finish. flush_work() doesn't work for PREEMPT_NONE
> + * because we may get scheduled between @work's completion
> + * and the other canceling task resuming and clearing
> + * CANCELING - flush_work() will return false immediately
> + * as @work is no longer busy, try_to_grab_pending() will
> + * return -ENOENT as @work is still being canceled and the
> + * other canceling task won't be able to clear CANCELING as
> + * we're hogging the CPU.
> + *
> + * Explicitly wait for completion using a bit waitqueue.
> + * We can't use wait_on_bit() as the CANCELING bit may get
> + * recycled to point to pwq if @work gets re-queued.
> */
> - if (unlikely(ret == -ENOENT))
> - flush_work(work);
> + if (unlikely(ret == -ENOENT)) {
> + DEFINE_WAIT_BIT(wait, &work->data,
> + __WORK_OFFQ_CANCELING);
> + prepare_to_wait(waitq, &wait.wait,
> + TASK_UNINTERRUPTIBLE);
> + if (work_is_canceling(work))
> + schedule();
> + finish_wait(waitq, &wait.wait);
> + }
> } while (unlikely(ret < 0));
>
> /* tell other tasks trying to grab @work to back off */
> @@ -2749,6 +2769,15 @@ static bool __cancel_work_timer(struct work_struct *work, bool is_dwork)
>
> flush_work(work);
> clear_work_data(work);
> +
> + /*
> + * Paired with prepare_to_wait() above so that either
> + * __wake_up_bit() sees busy waitq here or !work_is_canceling() is
> + * visible there.
> + */
> + smp_mb();
> + __wake_up_bit(waitq, &work->data, __WORK_OFFQ_CANCELING);
> +
> return ret;
> }
>
> --
> 2.1.0
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/