Re: [net-next] bpf: avoid hashtab deadlock with try_lock

From: Waiman Long
Date: Tue Nov 29 2022 - 11:07:51 EST


On 11/29/22 07:45, Hou Tao wrote:
Hi,

On 11/29/2022 2:06 PM, Tonghao Zhang wrote:
On Tue, Nov 29, 2022 at 12:32 PM Hou Tao <houtao1@xxxxxxxxxx> wrote:
Hi,

On 11/29/2022 5:55 AM, Hao Luo wrote:
On Sun, Nov 27, 2022 at 7:15 PM Tonghao Zhang <xiangxia.m.yue@xxxxxxxxx> wrote:
Hi Tonghao,

With a quick look at the htab_lock_bucket() and your problem
statement, I agree with Hou Tao that using hash &
min(HASHTAB_MAP_LOCK_MASK, n_bucket - 1) to index in map_locked seems
to fix the potential deadlock. Can you actually send your changes as
v2 so we can take a look and better help you? Also, can you explain
your solution in your commit message? Right now, your commit message
has only a problem statement and is not very clear. Please include
more details on what you do to fix the issue.

Hao
It would be better if the test case below can be rewritten as a bpf selftests.
Please see comments below on how to improve it and reproduce the deadlock.
Hi
only a warning from lockdep.
Thanks for your details instruction. I can reproduce the warning by using your
setup. I am not a lockdep expert, it seems that fixing such warning needs to set
different lockdep class to the different bucket. Because we use map_locked to
protect the acquisition of bucket lock, so I think we can define lock_class_key
array in bpf_htab (e.g., lockdep_key[HASHTAB_MAP_LOCK_COUNT]) and initialize the
bucket lock accordingly.
The proposed lockdep solution doesn't work. Still got lockdep warning after
that, so cc +locking expert +lkml.org for lockdep help.

Hi lockdep experts,

We are trying to fix the following lockdep warning from bpf subsystem:

[   36.092222] ================================
[   36.092230] WARNING: inconsistent lock state
[   36.092234] 6.1.0-rc5+ #81 Tainted: G            E
[   36.092236] --------------------------------
[   36.092237] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[   36.092238] perf/1515 [HC1[1]:SC0[0]:HE0:SE1] takes:
[   36.092242] ffff888341acd1a0 (&htab->lockdep_key){....}-{2:2}, at:
htab_lock_bucket+0x4d/0x58
[   36.092253] {INITIAL USE} state was registered at:
[   36.092255]   mark_usage+0x1d/0x11d
[   36.092262]   __lock_acquire+0x3c9/0x6ed
[   36.092266]   lock_acquire+0x23d/0x29a
[   36.092270]   _raw_spin_lock_irqsave+0x43/0x7f
[   36.092274]   htab_lock_bucket+0x4d/0x58
[   36.092276]   htab_map_delete_elem+0x82/0xfb
[   36.092278]   map_delete_elem+0x156/0x1ac
[   36.092282]   __sys_bpf+0x138/0xb71
[   36.092285]   __do_sys_bpf+0xd/0x15
[   36.092288]   do_syscall_64+0x6d/0x84
[   36.092291]   entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   36.092295] irq event stamp: 120346
[   36.092296] hardirqs last  enabled at (120345): [<ffffffff8180b97f>]
_raw_spin_unlock_irq+0x24/0x39
[   36.092299] hardirqs last disabled at (120346): [<ffffffff81169e85>]
generic_exec_single+0x40/0xb9
[   36.092303] softirqs last  enabled at (120268): [<ffffffff81c00347>]
__do_softirq+0x347/0x387
[   36.092307] softirqs last disabled at (120133): [<ffffffff810ba4f0>]
__irq_exit_rcu+0x67/0xc6
[   36.092311]
[   36.092311] other info that might help us debug this:
[   36.092312]  Possible unsafe locking scenario:
[   36.092312]
[   36.092313]        CPU0
[   36.092313]        ----
[   36.092314]   lock(&htab->lockdep_key);
[   36.092315]   <Interrupt>
[   36.092316]     lock(&htab->lockdep_key);
[   36.092318]
[   36.092318]  *** DEADLOCK ***
[   36.092318]
[   36.092318] 3 locks held by perf/1515:
[   36.092320]  #0: ffff8881b9805cc0 (&cpuctx_mutex){+.+.}-{4:4}, at:
perf_event_ctx_lock_nested+0x8e/0xba
[   36.092327]  #1: ffff8881075ecc20 (&event->child_mutex){+.+.}-{4:4}, at:
perf_event_for_each_child+0x35/0x76
[   36.092332]  #2: ffff8881b9805c20 (&cpuctx_lock){-.-.}-{2:2}, at:
perf_ctx_lock+0x12/0x27
[   36.092339]
[   36.092339] stack backtrace:
[   36.092341] CPU: 0 PID: 1515 Comm: perf Tainted: G            E
6.1.0-rc5+ #81
[   36.092344] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   36.092349] Call Trace:
[   36.092351]  <NMI>
[   36.092354]  dump_stack_lvl+0x57/0x81
[   36.092359]  lock_acquire+0x1f4/0x29a
[   36.092363]  ? handle_pmi_common+0x13f/0x1f0
[   36.092366]  ? htab_lock_bucket+0x4d/0x58
[   36.092371]  _raw_spin_lock_irqsave+0x43/0x7f
[   36.092374]  ? htab_lock_bucket+0x4d/0x58
[   36.092377]  htab_lock_bucket+0x4d/0x58
[   36.092379]  htab_map_update_elem+0x11e/0x220
[   36.092386]  bpf_prog_f3a535ca81a8128a_bpf_prog2+0x3e/0x42
[   36.092392]  trace_call_bpf+0x177/0x215
[   36.092398]  perf_trace_run_bpf_submit+0x52/0xaa
[   36.092403]  ? x86_pmu_stop+0x97/0x97
[   36.092407]  perf_trace_nmi_handler+0xb7/0xe0
[   36.092415]  nmi_handle+0x116/0x254
[   36.092418]  ? x86_pmu_stop+0x97/0x97
[   36.092423]  default_do_nmi+0x3d/0xf6
[   36.092428]  exc_nmi+0xa1/0x109
[   36.092432]  end_repeat_nmi+0x16/0x67
[   36.092436] RIP: 0010:wrmsrl+0xd/0x1b

So the lock is really taken in a NMI context. In general, we advise again using lock in a NMI context unless it is a lock that is used only in that context. Otherwise, deadlock is certainly a possibility as there is no way to mask off again NMI.

Cheers,
Longman