Re: [net-next] bpf: avoid hashtab deadlock with try_lock

From: Waiman Long
Date: Tue Nov 29 2022 - 23:08:27 EST


On 11/29/22 22:32, Tonghao Zhang wrote:
On Wed, Nov 30, 2022 at 11:07 AM Waiman Long <longman@xxxxxxxxxx> wrote:
On 11/29/22 21:47, Tonghao Zhang wrote:
On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
Hi Hao,

On 11/30/2022 3:36 AM, Hao Luo wrote:
On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <boqun.feng@xxxxxxxxx> wrote:
Just to be clear, I meant to refactor htab_lock_bucket() into a try
lock pattern. Also after a second thought, the below suggestion doesn't
work. I think the proper way is to make htab_lock_bucket() as a
raw_spin_trylock_irqsave().

Regards,
Boqun

The potential deadlock happens when the lock is contended from the
same cpu. When the lock is contended from a remote cpu, we would like
the remote cpu to spin and wait, instead of giving up immediately. As
this gives better throughput. So replacing the current
raw_spin_lock_irqsave() with trylock sacrifices this performance gain.

I suspect the source of the problem is the 'hash' that we used in
htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
whether we should use a hash derived from 'bucket' rather than from
'key'. For example, from the memory address of the 'bucket'. Because,
different keys may fall into the same bucket, but yield different
hashes. If the same bucket can never have two different 'hashes' here,
the map_locked check should behave as intended. Also because
->map_locked is per-cpu, execution flows from two different cpus can
both pass.
The warning from lockdep is due to the reason the bucket lock A is used in a
no-NMI context firstly, then the same bucke lock is used a NMI context, so
Yes, I tested lockdep too, we can't use the lock in NMI(but only
try_lock work fine) context if we use them no-NMI context. otherwise
the lockdep prints the warning.
* for the dead-lock case: we can use the
1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
2. or hash bucket address.

* for lockdep warning, we should use in_nmi check with map_locked.

BTW, the patch doesn't work, so we can remove the lock_key
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9

static inline int htab_lock_bucket(const struct bpf_htab *htab,
struct bucket *b, u32 hash,
unsigned long *pflags)
{
unsigned long flags;

hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);

preempt_disable();
if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
__this_cpu_dec(*(htab->map_locked[hash]));
preempt_enable();
return -EBUSY;
}

if (in_nmi()) {
if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
return -EBUSY;
That is not right. You have to do the same step as above by decrementing
the percpu count and enable preemption. So you may want to put all these
busy_out steps after the return 0 and use "goto busy_out;" to jump there.
Yes, thanks Waiman, I should add the busy_out label.
} else {
raw_spin_lock_irqsave(&b->raw_lock, flags);
}

*pflags = flags;
return 0;
}
BTW, with that change, I believe you can actually remove all the percpu
map_locked count code.
there are some case, for example, we run the bpf_prog A B in task
context on the same cpu.
bpf_prog A
update map X
htab_lock_bucket
raw_spin_lock_irqsave()
lookup_elem_raw()
// bpf prog B is attached on lookup_elem_raw()
bpf prog B
update map X again and update the element
htab_lock_bucket()
// dead-lock
raw_spinlock_irqsave()

I see, so nested locking is possible in this case. Beside using the percpu map_lock, another way is to have cpumask associated with each bucket lock and use each bit in the cpumask for to control access using test_and_set_bit() for each cpu. That will allow more concurrency and you can actually find out how contended is the lock. Anyway, it is just a thought.

Cheers,
Longman