Re: [syzbot] [fs?] INFO: task hung in synchronize_rcu (4)

From: Peter Zijlstra
Date: Fri May 05 2023 - 04:36:14 EST


On Thu, May 04, 2023 at 04:01:23PM +0900, Tetsuo Handa wrote:
> On 2023/05/04 15:16, Hillf Danton wrote:
> >> 4 locks held by syz-executor.2/5077:
> >> #0: ffff8880b993c2d8 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2f/0x120 kernel/sched/core.c:539
> >> #1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: mm_cid_get kernel/sched/sched.h:3280 [inline]
> >> #1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: switch_mm_cid kernel/sched/sched.h:3302 [inline]
> >> #1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: prepare_task_switch kernel/sched/core.c:5117 [inline]
> >> #1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: context_switch kernel/sched/core.c:5258 [inline]
> >> #1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: __schedule+0x2802/0x5770 kernel/sched/core.c:6625
> >> #2: ffff8880b9929698 (&base->lock){-.-.}-{2:2}, at: lock_timer_base+0x5a/0x1f0 kernel/time/timer.c:999
> >> #3: ffffffff91fb4ac8 (&obj_hash[i].lock){-.-.}-{2:2}, at: debug_object_activate+0x134/0x3f0 lib/debugobjects.c:690
> >
> > What is hard to understand in this report is, how could acquire the
> > timer base lock with the mm cid lock held [1]?
>
> Please be aware that lockdep_print_held_locks() is not an atomic action.
> Since synchronous printk() is slow, it can sometimes happen that
> task_is_running(p) becomes true after passing the
>
> if (p != current && task_is_running(p))
> return;
>
> check. I think that this trace is an example where print_lock() by chance hit
> hlock_class(p->held_locks + 2) != NULL. If sched_show_task() were also available,
> we can know it via mismatch between sched_show_task() and lockdep_print_held_locks().
>
> Linus, I think that "[PATCH v3 (repost)] locking/lockdep: add debug_show_all_lock_holders()"
> helps here, but I can't wake up locking people. What can we do?

How is that not also racy ?

I think I've seen that patch, and it had a some 'blurb' Changelog that
leaves me wondering wtf the actual problem is and how it attempts to
solve it and I went on with looking at regressions because more
important than random weird patch.