Re: [PATCH v2 1/1] workqueue: fix data race with the pwq->stats[] increment

From: Mirsad Todorovac
Date: Wed Aug 30 2023 - 16:17:02 EST


On 8/29/23 21:52, Tejun Heo wrote:
On Sat, Aug 26, 2023 at 04:51:03PM +0200, Mirsad Goran Todorovac wrote:
KCSAN has discovered a data race in kernel/workqueue.c:2598:

[ 1863.554079] ==================================================================
[ 1863.554118] BUG: KCSAN: data-race in process_one_work / process_one_work

[ 1863.554142] write to 0xffff963d99d79998 of 8 bytes by task 5394 on cpu 27:
[ 1863.554154] process_one_work (kernel/workqueue.c:2598)
[ 1863.554166] worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2752)
[ 1863.554177] kthread (kernel/kthread.c:389)
[ 1863.554186] ret_from_fork (arch/x86/kernel/process.c:145)
[ 1863.554197] ret_from_fork_asm (arch/x86/entry/entry_64.S:312)

[ 1863.554213] read to 0xffff963d99d79998 of 8 bytes by task 5450 on cpu 12:
[ 1863.554224] process_one_work (kernel/workqueue.c:2598)
[ 1863.554235] worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2752)
[ 1863.554247] kthread (kernel/kthread.c:389)
[ 1863.554255] ret_from_fork (arch/x86/kernel/process.c:145)
[ 1863.554266] ret_from_fork_asm (arch/x86/entry/entry_64.S:312)

[ 1863.554280] value changed: 0x0000000000001766 -> 0x000000000000176a

[ 1863.554295] Reported by Kernel Concurrency Sanitizer on:
[ 1863.554303] CPU: 12 PID: 5450 Comm: kworker/u64:1 Tainted: G L 6.5.0-rc6+ #44
[ 1863.554314] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
[ 1863.554322] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
[ 1863.554941] ==================================================================

lockdep_invariant_state(true);
→ pwq->stats[PWQ_STAT_STARTED]++;
trace_workqueue_execute_start(work);
worker->current_func(work);

Moving pwq->stats[PWQ_STAT_STARTED]++; before the line

raw_spin_unlock_irq(&pool->lock);

resolves the data race without performance penalty.

KCSAN detected at least one additional data race:

[ 157.834751] ==================================================================
[ 157.834770] BUG: KCSAN: data-race in process_one_work / process_one_work

[ 157.834793] write to 0xffff9934453f77a0 of 8 bytes by task 468 on cpu 29:
[ 157.834804] process_one_work (/home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2606)
[ 157.834815] worker_thread (/home/marvin/linux/kernel/linux_torvalds/./include/linux/list.h:292 /home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2752)
[ 157.834826] kthread (/home/marvin/linux/kernel/linux_torvalds/kernel/kthread.c:389)
[ 157.834834] ret_from_fork (/home/marvin/linux/kernel/linux_torvalds/arch/x86/kernel/process.c:145)
[ 157.834845] ret_from_fork_asm (/home/marvin/linux/kernel/linux_torvalds/arch/x86/entry/entry_64.S:312)

[ 157.834859] read to 0xffff9934453f77a0 of 8 bytes by task 214 on cpu 7:
[ 157.834868] process_one_work (/home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2606)
[ 157.834879] worker_thread (/home/marvin/linux/kernel/linux_torvalds/./include/linux/list.h:292 /home/marvin/linux/kernel/linux_torvalds/kernel/workqueue.c:2752)
[ 157.834890] kthread (/home/marvin/linux/kernel/linux_torvalds/kernel/kthread.c:389)
[ 157.834897] ret_from_fork (/home/marvin/linux/kernel/linux_torvalds/arch/x86/kernel/process.c:145)
[ 157.834907] ret_from_fork_asm (/home/marvin/linux/kernel/linux_torvalds/arch/x86/entry/entry_64.S:312)

[ 157.834920] value changed: 0x000000000000052a -> 0x0000000000000532

[ 157.834933] Reported by Kernel Concurrency Sanitizer on:
[ 157.834941] CPU: 7 PID: 214 Comm: kworker/u64:2 Tainted: G L 6.5.0-rc7-kcsan-00169-g81eaf55a60fc #4
[ 157.834951] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
[ 157.834958] Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
[ 157.835567] ==================================================================

in code:

trace_workqueue_execute_end(work, worker->current_func);
→ pwq->stats[PWQ_STAT_COMPLETED]++;
lock_map_release(&lockdep_map);
lock_map_release(&pwq->wq->lockdep_map);

which needs to be resolved separately.

Fixes: 725e8ec59c56c ("workqueue: Add pwq->stats[] and a monitoring script")
Cc: Tejun Heo <tj@xxxxxxxxxx>
Suggested-by: Lai Jiangshan <jiangshanlai@xxxxxxxxx>
Link: https://lore.kernel.org/lkml/20230818194448.29672-1-mirsad.todorovac@xxxxxxxxxxxx/
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@xxxxxxxxxxxx>

Applied to wq/for-6.6.

Thanks.

Glad it worked out-of-the-box ...

Now, what should we do with the

trace_workqueue_execute_end(work, worker->current_func);
→ pwq->stats[PWQ_STAT_COMPLETED]++;
lock_map_release(&lockdep_map);
lock_map_release(&pwq->wq->lockdep_map);

???

There seem to be 8 occurrences of the use of pwq->stats[] ...

Some are protected with spinlocks:

-1/8-[SAFE PROTECTED]--------------------------
raw_spin_lock_irq(&pool->lock);

/*
* Recheck in case unbind_workers() preempted us. We don't
* want to decrement nr_running after the worker is unbound
* and nr_running has been reset.
*/
if (worker->flags & WORKER_NOT_RUNNING) {
raw_spin_unlock_irq(&pool->lock);
return;
}

pool->nr_running--;
if (need_more_worker(pool)) {
→ worker->current_pwq->stats[PWQ_STAT_CM_WAKEUP]++;
wake_up_worker(pool);
}
raw_spin_unlock_irq(&pool->lock);
-----------------------------------------------

This isn't, but the other one is:
-2/8-[POSSIBLE DATA RACE]----------------------
if (!pwq)
return;

→ pwq->stats[PWQ_STAT_CPU_TIME] += TICK_USEC;

if (!wq_cpu_intensive_thresh_us)
return;
-----------------------------------------------

-3+4/8--[SAFE BETWEEN LOCK AND UNLOCK]----------
raw_spin_lock(&pool->lock);

worker_set_flags(worker, WORKER_CPU_INTENSIVE);
wq_cpu_intensive_report(worker->current_func);
→ pwq->stats[PWQ_STAT_CPU_INTENSIVE]++;

if (need_more_worker(pool)) {
→ pwq->stats[PWQ_STAT_CM_WAKEUP]++;
wake_up_worker(pool);
}

raw_spin_unlock(&pool->lock);
-------------------------------------------------

This one I don't know:
-5/8-[SUSPICIOUS FOR A RACE]---------------------
if (list_empty(&pwq->mayday_node)) {
/*
* If @pwq is for an unbound wq, its base ref may be put at
* any time due to an attribute change. Pin @pwq until the
* rescuer is done with it.
*/
get_pwq(pwq);
list_add_tail(&pwq->mayday_node, &wq->maydays);
wake_up_process(wq->rescuer->task);
→ pwq->stats[PWQ_STAT_MAYDAY]++;
}
-------------------------------------------------

This one we just solved:

-6/8-[WE JUST SOLVED THIS ONE]-------------------
lockdep_invariant_state(true);
→ pwq->stats[PWQ_STAT_STARTED]++;
trace_workqueue_execute_start(work);
worker->current_func(work);
-------------------------------------------------


-7/8-[A CANDIDATE FOR atomic_inc() ??? ]---------
trace_workqueue_execute_end(work, worker->current_func);
→ pwq->stats[PWQ_STAT_COMPLETED]++;
lock_map_release(&lockdep_map);
lock_map_release(&pwq->wq->lockdep_map);
-------------------------------------------------

-8/8-[SAFE BETWEEN LOCK AND UNLOCK]--------------
raw_spin_lock_irq(&pool->lock);

/*
* Slurp in all works issued via this workqueue and
* process'em.
*/
WARN_ON_ONCE(!list_empty(scheduled));
list_for_each_entry_safe(work, n, &pool->worklist, entry) {
if (get_work_pwq(work) == pwq) {
if (first)
pool->watchdog_ts = jiffies;
move_linked_works(work, scheduled, &n);
→ pwq->stats[PWQ_STAT_RESCUED]++;
}
first = false;
}

if (!list_empty(scheduled)) {
process_scheduled_works(rescuer);

/*
* The above execution of rescued work items could
* have created more to rescue through
* pwq_activate_first_inactive() or chained
* queueing. Let's put @pwq back on mayday list so
* that such back-to-back work items, which may be
* being used to relieve memory pressure, don't
* incur MAYDAY_INTERVAL delay inbetween.
*/
if (pwq->nr_active && need_to_create_worker(pool)) {
raw_spin_lock(&wq_mayday_lock);
/*
* Queue iff we aren't racing destruction
* and somebody else hasn't queued it already.
*/
if (wq->rescuer && list_empty(&pwq->mayday_node)) {
get_pwq(pwq);
list_add_tail(&pwq->mayday_node, &wq->maydays);
}
raw_spin_unlock(&wq_mayday_lock);
}
}

/*
* Put the reference grabbed by send_mayday(). @pool won't
* go away while we're still attached to it.
*/
put_pwq(pwq);

/*
* Leave this pool. If need_more_worker() is %true, notify a
* regular worker; otherwise, we end up with 0 concurrency
* and stalling the execution.
*/
if (need_more_worker(pool))
wake_up_worker(pool);

raw_spin_unlock_irq(&pool->lock);
-----------------------------------------------------

These might just not have been triggered yet, but logically they seem suspicious for a data race.
I still have to verify if they are called from a locked context ...

Of course, as the original developer, you will have a clearer picture. I have been led by KCSAN
to this particular detail.

Hope this helps.

Best regards,
Mirsad Todorovac