Re: [PATCH] workqueue: Fix spurious sanity check failures in destroy_workqueue()

From: Lai Jiangshan
Date: Wed Sep 18 2019 - 22:49:18 EST


Looks good to me.

There is one test in show_pwq()
"""
worker == pwq->wq->rescuer ? "(RESCUER)" : "",
"""
I'm wondering if it needs to be updated to
"""
worker->rescue_wq ? "(RESCUER)" : "",
"""

And document "/* MD: rescue worker */" might be better
than current "/* I: rescue worker */", although ->rescuer can
be accessed without wq_mayday_lock lock in some code.

Reviewed-by: Lai Jiangshan <jiangshanlai@xxxxxxxxx>



On Thu, Sep 19, 2019 at 9:43 AM Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> Before actually destrying a workqueue, destroy_workqueue() checks
> whether it's actually idle. If it isn't, it prints out a bunch of
> warning messages and leaves the workqueue dangling. It unfortunately
> has a couple issues.
>
> * Mayday list queueing increments pwq's refcnts which gets detected as
> busy and fails the sanity checks. However, because mayday list
> queueing is asynchronous, this condition can happen without any
> actual work items left in the workqueue.
>
> * Sanity check failure leaves the sysfs interface behind too which can
> lead to init failure of newer instances of the workqueue.
>
> This patch fixes the above two by
>
> * If a workqueue has a rescuer, disable and kill the rescuer before
> sanity checks. Disabling and killing is guaranteed to flush the
> existing mayday list.
>
> * Remove sysfs interface before sanity checks.
>
> Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
> Reported-by: Marcin Pawlowski <mpawlowski@xxxxxx>
> Reported-by: "Williams, Gerald S" <gerald.s.williams@xxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> ---
> Applying to wq/for-5.4.
>
> Thanks.
>
> kernel/workqueue.c | 24 +++++++++++++++++++-----
> 1 file changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 95aea04ff722..73e3ea9e31d3 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4318,9 +4318,28 @@ void destroy_workqueue(struct workqueue_struct *wq)
> struct pool_workqueue *pwq;
> int node;
>
> + /*
> + * Remove it from sysfs first so that sanity check failure doesn't
> + * lead to sysfs name conflicts.
> + */
> + workqueue_sysfs_unregister(wq);
> +
> /* drain it before proceeding with destruction */
> drain_workqueue(wq);
>
> + /* kill rescuer, if sanity checks fail, leave it w/o rescuer */
> + if (wq->rescuer) {
> + struct worker *rescuer = wq->rescuer;
> +
> + /* this prevents new queueing */
> + spin_lock_irq(&wq_mayday_lock);
> + wq->rescuer = NULL;
> + spin_unlock_irq(&wq_mayday_lock);
> +
> + /* rescuer will empty maydays list before exiting */
> + kthread_stop(rescuer->task);
> + }
> +
> /* sanity checks */
> mutex_lock(&wq->mutex);
> for_each_pwq(pwq, wq) {
> @@ -4352,11 +4371,6 @@ void destroy_workqueue(struct workqueue_struct *wq)
> list_del_rcu(&wq->list);
> mutex_unlock(&wq_pool_mutex);
>
> - workqueue_sysfs_unregister(wq);
> -
> - if (wq->rescuer)
> - kthread_stop(wq->rescuer->task);
> -
> if (!(wq->flags & WQ_UNBOUND)) {
> wq_unregister_lockdep(wq);
> /*
> --
> 2.17.1
>