Re: [net-next] bpf: avoid hashtab deadlock with try_lock

From: Waiman Long
Date: Tue Nov 29 2022 - 22:08:10 EST


On 11/29/22 21:47, Tonghao Zhang wrote:
On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
Hi Hao,

On 11/30/2022 3:36 AM, Hao Luo wrote:
On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <boqun.feng@xxxxxxxxx> wrote:
Just to be clear, I meant to refactor htab_lock_bucket() into a try
lock pattern. Also after a second thought, the below suggestion doesn't
work. I think the proper way is to make htab_lock_bucket() as a
raw_spin_trylock_irqsave().

Regards,
Boqun

The potential deadlock happens when the lock is contended from the
same cpu. When the lock is contended from a remote cpu, we would like
the remote cpu to spin and wait, instead of giving up immediately. As
this gives better throughput. So replacing the current
raw_spin_lock_irqsave() with trylock sacrifices this performance gain.

I suspect the source of the problem is the 'hash' that we used in
htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
whether we should use a hash derived from 'bucket' rather than from
'key'. For example, from the memory address of the 'bucket'. Because,
different keys may fall into the same bucket, but yield different
hashes. If the same bucket can never have two different 'hashes' here,
the map_locked check should behave as intended. Also because
->map_locked is per-cpu, execution flows from two different cpus can
both pass.
The warning from lockdep is due to the reason the bucket lock A is used in a
no-NMI context firstly, then the same bucke lock is used a NMI context, so
Yes, I tested lockdep too, we can't use the lock in NMI(but only
try_lock work fine) context if we use them no-NMI context. otherwise
the lockdep prints the warning.
* for the dead-lock case: we can use the
1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
2. or hash bucket address.

* for lockdep warning, we should use in_nmi check with map_locked.

BTW, the patch doesn't work, so we can remove the lock_key
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9

static inline int htab_lock_bucket(const struct bpf_htab *htab,
struct bucket *b, u32 hash,
unsigned long *pflags)
{
unsigned long flags;

hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);

preempt_disable();
if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
__this_cpu_dec(*(htab->map_locked[hash]));
preempt_enable();
return -EBUSY;
}

if (in_nmi()) {
if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
return -EBUSY;
That is not right. You have to do the same step as above by decrementing the percpu count and enable preemption. So you may want to put all these busy_out steps after the return 0 and use "goto busy_out;" to jump there.
} else {
raw_spin_lock_irqsave(&b->raw_lock, flags);
}

*pflags = flags;
return 0;
}

BTW, with that change, I believe you can actually remove all the percpu map_locked count code.

Cheers,
Longman