Re: [PATCH] workqueue: don't skip lockdep wq dependency in cancel_work_sync()

From: Lai Jiangshan
Date: Thu Jul 28 2022 - 22:38:53 EST


On Thu, Jul 28, 2022 at 8:23 PM Tetsuo Handa
<penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote:
>
> Like Hillf Danton mentioned
>
> syzbot should have been able to catch cancel_work_sync() in work context
> by checking lockdep_map in __flush_work() for both flush and cancel.
>
> in [1], being unable to report an obvious deadlock scenario shown below is
> broken. From locking dependency perspective, sync version of cancel request
> should behave as if flush request, for it waits for completion of work if
> that work has already started execution.
>
> ----------
> #include <linux/module.h>
> #include <linux/sched.h>
> static DEFINE_MUTEX(mutex);
> static void work_fn(struct work_struct *work)
> {
> schedule_timeout_uninterruptible(HZ / 5);
> mutex_lock(&mutex);
> mutex_unlock(&mutex);
> }
> static DECLARE_WORK(work, work_fn);
> static int __init test_init(void)
> {
> schedule_work(&work);
> schedule_timeout_uninterruptible(HZ / 10);
> mutex_lock(&mutex);
> cancel_work_sync(&work);
> mutex_unlock(&mutex);
> return -EINVAL;
> }
> module_init(test_init);
> MODULE_LICENSE("GPL");
> ----------
>
> Link: https://lkml.kernel.org/r/20220504044800.4966-1-hdanton@xxxxxxxx [1]
> Reported-by: Hillf Danton <hdanton@xxxxxxxx>
> Fixes: d6e89786bed977f3 ("workqueue: skip lockdep wq dependency in cancel_work_sync()")
> Cc: Johannes Berg <johannes.berg@xxxxxxxxx>
> Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> ---
> kernel/workqueue.c | 45 ++++++++++++++++++---------------------------
> 1 file changed, 18 insertions(+), 27 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 1ea50f6be843..e6df688f84db 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3000,8 +3000,7 @@ void drain_workqueue(struct workqueue_struct *wq)
> }
> EXPORT_SYMBOL_GPL(drain_workqueue);
>
> -static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
> - bool from_cancel)
> +static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr)
> {
> struct worker *worker = NULL;
> struct worker_pool *pool;
> @@ -3043,8 +3042,7 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
> * workqueues the deadlock happens when the rescuer stalls, blocking
> * forward progress.
> */
> - if (!from_cancel &&
> - (pwq->wq->saved_max_active == 1 || pwq->wq->rescuer)) {
> + if (pwq->wq->saved_max_active == 1 || pwq->wq->rescuer) {
> lock_map_acquire(&pwq->wq->lockdep_map);
> lock_map_release(&pwq->wq->lockdep_map);
> }
> @@ -3056,7 +3054,18 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
> return false;
> }
>
> -static bool __flush_work(struct work_struct *work, bool from_cancel)
> +/**
> + * flush_work - wait for a work to finish executing the last queueing instance
> + * @work: the work to flush
> + *
> + * Wait until @work has finished execution. @work is guaranteed to be idle
> + * on return if it hasn't been requeued since flush started.
> + *
> + * Return:
> + * %true if flush_work() waited for the work to finish execution,
> + * %false if it was already idle.
> + */
> +bool flush_work(struct work_struct *work)
> {
> struct wq_barrier barr;
>
> @@ -3066,12 +3075,10 @@ static bool __flush_work(struct work_struct *work, bool from_cancel)
> if (WARN_ON(!work->func))
> return false;
>
> - if (!from_cancel) {
> - lock_map_acquire(&work->lockdep_map);
> - lock_map_release(&work->lockdep_map);
> - }
> + lock_map_acquire(&work->lockdep_map);
> + lock_map_release(&work->lockdep_map);


IIUC, I think the change of these 5 lines of code (-3+2) is enough
to fix the problem described in the changelog.

If so, could you make a minimal patch?

I believe what the commit d6e89786bed977f3 ("workqueue: skip lockdep
wq dependency in cancel_work_sync()") fixes is real. It is not a good
idea to revert it.

P.S.

The commit fd1a5b04dfb8("workqueue: Remove now redundant lock
acquisitions wrt. workqueue flushes") removed this lockdep check.

And the commit 87915adc3f0a("workqueue: re-add lockdep
dependencies for flushing") added it back for non-canceling cases.

It seems the commit fd1a5b04dfb8 is the culprit and 87915adc3f0a
didn't fixes all the problem of it.

So it is better to complete 87915adc3f0a by making __flush_work()
does lock_map_acquire(&work->lockdep_map) for both canceling and
non-canceling cases.