Re: [PATCH] kernfs: convert kernfs_idr_lock to an irq safe raw spinlock

From: Geert Uytterhoeven
Date: Tue Jan 09 2024 - 11:36:04 EST


Hi Andrea,

On Fri, Dec 29, 2023 at 8:56 AM Andrea Righi <andrea.righi@xxxxxxxxxxxxx> wrote:
> bpf_cgroup_from_id() is basically a wrapper to cgroup_get_from_id(),
> that is relying on kernfs to determine the right cgroup associated to
> the target id.
>
> As a kfunc, it has the potential to be attached to any function through
> BPF, particularly in contexts where certain locks are held.
>
> However, kernfs is not using an irq safe spinlock for kernfs_idr_lock,
> that means any kernfs function that is acquiring this lock can be
> interrupted and potentially hit bpf_cgroup_from_id() in the process,
> triggering a deadlock.
>
> For example, it is really easy to trigger a lockdep splat between
> kernfs_idr_lock and rq->_lock, attaching a small BPF program to
> __set_cpus_allowed_ptr_locked() that just calls bpf_cgroup_from_id():
>
> =====================================================
> WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
> 6.7.0-rc7-virtme #5 Not tainted
> -----------------------------------------------------
> repro/131 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> ffffffffb2dc4578 (kernfs_idr_lock){+.+.}-{2:2}, at: kernfs_find_and_get_node_by_id+0x1d/0x80
>
> and this task is already holding:
> ffff911cbecaf218 (&rq->__lock){-.-.}-{2:2}, at: task_rq_lock+0x50/0xc0
> which would create a new lock dependency:
> (&rq->__lock){-.-.}-{2:2} -> (kernfs_idr_lock){+.+.}-{2:2}
>
> but this new dependency connects a HARDIRQ-irq-safe lock:
> (&rq->__lock){-.-.}-{2:2}
>
> ... which became HARDIRQ-irq-safe at:
> lock_acquire+0xbf/0x2b0
> _raw_spin_lock_nested+0x2e/0x40
> scheduler_tick+0x5d/0x170
> update_process_times+0x9c/0xb0
> tick_periodic+0x27/0xe0
> tick_handle_periodic+0x24/0x70
> __sysvec_apic_timer_interrupt+0x64/0x1a0
> sysvec_apic_timer_interrupt+0x6f/0x80
> asm_sysvec_apic_timer_interrupt+0x1a/0x20
> memcpy+0xc/0x20
> arch_dup_task_struct+0x15/0x30
> copy_process+0x1ce/0x1eb0
> kernel_clone+0xac/0x390
> kernel_thread+0x6f/0xa0
> kthreadd+0x199/0x230
> ret_from_fork+0x31/0x50
> ret_from_fork_asm+0x1b/0x30
>
> to a HARDIRQ-irq-unsafe lock:
> (kernfs_idr_lock){+.+.}-{2:2}
>
> ... which became HARDIRQ-irq-unsafe at:
> ...
> lock_acquire+0xbf/0x2b0
> _raw_spin_lock+0x30/0x40
> __kernfs_new_node.isra.0+0x83/0x280
> kernfs_create_root+0xf6/0x1d0
> sysfs_init+0x1b/0x70
> mnt_init+0xd9/0x2a0
> vfs_caches_init+0xcf/0xe0
> start_kernel+0x58a/0x6a0
> x86_64_start_reservations+0x18/0x30
> x86_64_start_kernel+0xc5/0xe0
> secondary_startup_64_no_verify+0x178/0x17b
>
> other info that might help us debug this:
>
> Possible interrupt unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(kernfs_idr_lock);
> local_irq_disable();
> lock(&rq->__lock);
> lock(kernfs_idr_lock);
> <Interrupt>
> lock(&rq->__lock);
>
> *** DEADLOCK ***
>
> Prevent this deadlock condition converting kernfs_idr_lock to a raw irq
> safe spinlock.
>
> The performance impact of this change should be negligible and it also
> helps to prevent similar deadlock conditions with any other subsystems
> that may depend on kernfs.
>
> Fixes: 332ea1f697be ("bpf: Add bpf_cgroup_from_id() kfunc")
> Signed-off-by: Andrea Righi <andrea.righi@xxxxxxxxxxxxx>

Thanks for your patch, which is now commit c312828c37a72fe2
("kernfs: convert kernfs_idr_lock to an irq safe raw spinlock")
in driver-core/driver-core-next.

Unfortunately this interacts badly with commit 4eff7d62abdeb293 ("Revert
"mm/kmemleak: move the initialisation of object to __link_object"")
in v6.7-rc5.

driver-core/driver-core-next is still at v6.7-rc3, so it does not
yet have commit 4eff7d62abdeb293, and thus still triggers:

=============================
[ BUG: Invalid wait context ]
6.7.0-rc3-kzm9g-00052-gc312828c37a7 #576 Not tainted
-----------------------------
swapper/0 is trying to lock:
c0c6e3c4 (&zone->lock){....}-{3:3}, at: __rmqueue_pcplist+0x358/0x3c8
other info that might help us debug this:
context-{5:5}
3 locks held by swapper/0:
#0: c0bf35a0 (slab_mutex){....}-{4:4}, at:
kmem_cache_create_usercopy+0xc8/0x2d0
#1: c0bfab0c (kmemleak_lock){....}-{2:2}, at: __create_object+0x2c/0x7c
#2: dfbc8c90 (&pcp->lock){....}-{3:3}, at:
get_page_from_freelist+0x1a0/0x684
stack backtrace:
CPU: 0 PID: 0 Comm: swapper Not tainted
6.7.0-rc3-kzm9g-00052-gc312828c37a7 #576
Hardware name: Generic SH73A0 (Flattened Device Tree)
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x68/0x90
dump_stack_lvl from __lock_acquire+0x3cc/0x168c
__lock_acquire from lock_acquire+0x274/0x30c
lock_acquire from _raw_spin_lock_irqsave+0x50/0x64
_raw_spin_lock_irqsave from __rmqueue_pcplist+0x358/0x3c8
__rmqueue_pcplist from get_page_from_freelist+0x3bc/0x684
get_page_from_freelist from __alloc_pages+0xe8/0xad8
__alloc_pages from __stack_depot_save+0x160/0x398
__stack_depot_save from set_track_prepare+0x48/0x74
set_track_prepare from __link_object+0xac/0x204
__link_object from __create_object+0x48/0x7c
__create_object from kmemleak_alloc+0x2c/0x38
kmemleak_alloc from slab_post_alloc_hook.constprop.0+0x9c/0xac
slab_post_alloc_hook.constprop.0 from kmem_cache_alloc+0xcc/0x148
kmem_cache_alloc from kmem_cache_create_usercopy+0x1c4/0x2d0
kmem_cache_create_usercopy from kmem_cache_create+0x1c/0x24
kmem_cache_create from kmemleak_init+0x58/0xfc
kmemleak_init from mm_core_init+0x244/0x2c8
mm_core_init from start_kernel+0x274/0x528
start_kernel from 0x0

After merging driver-core/driver-core-next into a tree based on
v6.7-rc5, or after cherry-picking commit 4eff7d62abdeb293 into
driver-core/driver-core-next, the above BUG is gone, but a different
one appears:

=============================
[ BUG: Invalid wait context ]
6.7.0-rc5-kzm9g-00251-g655022a45b1c #578 Not tainted
-----------------------------
swapper/0/0 is trying to lock:
dfbcd488 (&c->lock){....}-{3:3}, at: local_lock_acquire+0x0/0xa4
other info that might help us debug this:
context-{5:5}
2 locks held by swapper/0/0:
#0: dfbc9c60 (lock){+.+.}-{3:3}, at: local_lock_acquire+0x0/0xa4
#1: c0c012a8 (kernfs_idr_lock){....}-{2:2}, at:
__kernfs_new_node.constprop.0+0x68/0x258
stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted
6.7.0-rc5-kzm9g-00251-g655022a45b1c #578
Hardware name: Generic SH73A0 (Flattened Device Tree)
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x68/0x90
dump_stack_lvl from __lock_acquire+0x3cc/0x168c
__lock_acquire from lock_acquire+0x274/0x30c
lock_acquire from local_lock_acquire+0x28/0xa4
local_lock_acquire from ___slab_alloc+0x234/0x8a8
___slab_alloc from __slab_alloc.constprop.0+0x30/0x44
__slab_alloc.constprop.0 from kmem_cache_alloc+0x7c/0x148
kmem_cache_alloc from radix_tree_node_alloc.constprop.0+0x44/0xdc
radix_tree_node_alloc.constprop.0 from idr_get_free+0x110/0x2b8
idr_get_free from idr_alloc_u32+0x9c/0x108
idr_alloc_u32 from idr_alloc_cyclic+0x50/0xb8
idr_alloc_cyclic from __kernfs_new_node.constprop.0+0x88/0x258
__kernfs_new_node.constprop.0 from kernfs_create_root+0xbc/0x154
kernfs_create_root from sysfs_init+0x18/0x5c
sysfs_init from mnt_init+0xc4/0x220
mnt_init from vfs_caches_init+0x6c/0x88
vfs_caches_init from start_kernel+0x474/0x528
start_kernel from 0x0

Reverting commit c312828c37a72fe2 fixes that.
I have seen this issue on several Renesas arm32 and arm64 platforms.

Also, I am wondering if the issue fixed by commit c312828c37a72fe2
can still be reproduced on v6.7-rc5 or later?

Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68korg

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds