Re: [RFC] kretprobe: Prevent triggering kretprobe from within kprobe_flush_task

From: Naveen N. Rao
Date: Thu Apr 09 2020 - 09:17:04 EST


Hi Masami,

Masami Hiramatsu wrote:
Hi Jiri,

On Wed, 8 Apr 2020 18:46:41 +0200
Jiri Olsa <jolsa@xxxxxxxxxx> wrote:

hi,
Ziqian reported lockup when adding retprobe on _raw_spin_lock_irqsave.

Hmm, kprobe is lockless, but kretprobe involves spinlock.
Thus, eventually, I will blacklist the _raw_spin_lock_irqsave()
for kretprobe.

As far as I can see, this is the only place where probing _raw_spin_lock_irqsave() is an issue. Should we blacklist all users for this case alone?

If you need to trace spinlock return, please consider to putting
kprobe at "ret" instruction.

My test was also able to trigger lockdep output:

============================================
WARNING: possible recursive locking detected
5.6.0-rc6+ #6 Not tainted
--------------------------------------------
sched-messaging/2767 is trying to acquire lock:
ffffffff9a492798 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_hash_lock+0x52/0xa0

but task is already holding lock:
ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(&(kretprobe_table_locks[i].lock));
lock(&(kretprobe_table_locks[i].lock));

*** DEADLOCK ***

May be due to missing lock nesting notation

1 lock held by sched-messaging/2767:
#0: ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50

stack backtrace:
CPU: 3 PID: 2767 Comm: sched-messaging Not tainted 5.6.0-rc6+ #6
Call Trace:
dump_stack+0x96/0xe0
__lock_acquire.cold.57+0x173/0x2b7
? native_queued_spin_lock_slowpath+0x42b/0x9e0
? lockdep_hardirqs_on+0x590/0x590
? __lock_acquire+0xf63/0x4030
lock_acquire+0x15a/0x3d0
? kretprobe_hash_lock+0x52/0xa0
_raw_spin_lock_irqsave+0x36/0x70
? kretprobe_hash_lock+0x52/0xa0
kretprobe_hash_lock+0x52/0xa0
trampoline_handler+0xf8/0x940
? kprobe_fault_handler+0x380/0x380
? find_held_lock+0x3a/0x1c0
kretprobe_trampoline+0x25/0x50
? lock_acquired+0x392/0xbc0
? _raw_spin_lock_irqsave+0x50/0x70
? __get_valid_kprobe+0x1f0/0x1f0
? _raw_spin_unlock_irqrestore+0x3b/0x40
? finish_task_switch+0x4b9/0x6d0
? __switch_to_asm+0x34/0x70
? __switch_to_asm+0x40/0x70

The code within the kretprobe handler checks for probe reentrancy,
so we won't trigger any _raw_spin_lock_irqsave probe in there.

The problem is in outside kprobe_flush_task, where we call:

kprobe_flush_task
kretprobe_table_lock
raw_spin_lock_irqsave
_raw_spin_lock_irqsave

where _raw_spin_lock_irqsave triggers the kretprobe and installs
kretprobe_trampoline handler on _raw_spin_lock_irqsave return.

Hmm, OK. In this case, I think we should mark this process is
going to die and never try to kretprobe on it.


The kretprobe_trampoline handler is then executed with already
locked kretprobe_table_locks, and first thing it does is to
lock kretprobe_table_locks ;-) the whole lockup path like:

kprobe_flush_task
kretprobe_table_lock
raw_spin_lock_irqsave
_raw_spin_lock_irqsave ---> probe triggered, kretprobe_trampoline installed

---> kretprobe_table_locks locked

kretprobe_trampoline
trampoline_handler
kretprobe_hash_lock(current, &head, &flags); <--- deadlock

The change below sets current_kprobe in kprobe_flush_task, so the probe
recursion protection check is hit and the probe is never set. It seems
to fix the deadlock.

I'm not sure this is the best fix, any ideas are welcome ;-)

Hmm, this is a bit tricky to fix this issue. Of course, temporary disable
kprobes (and kretprobe) on an area by filling current_kprobe might
be a good idea, but it also involves other kprobes.

Not sure how you mean that. Jiri's RFC patch would be disabling k[ret]probes within kprobe_flush_task(), which is only ever invoked from finish_task_switch(). I only see calls to spin locks and kfree() from here. Besides, kprobe_flush_task() itself is NOKPROBE, so we would ideally want to not trace/probe other functions it calls.


How about let kretprobe skip the task which state == TASK_DEAD ?

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 627fc1b7011a..3f207d2e0afb 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1874,9 +1874,12 @@ static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
* To avoid deadlocks, prohibit return probing in NMI contexts,
* just skip the probe and increase the (inexact) 'nmissed'
* statistical counter, so that the user is informed that
- * something happened:
+ * something happened.
+ * Also, if the current task is dead, we will already in the process
+ * to reclaim kretprobe instances from hash list. To avoid memory
+ * leak, skip to run the kretprobe on such task.
*/
- if (unlikely(in_nmi())) {
+ if (unlikely(in_nmi()) || current->state == TASK_DEAD) {

I'm wondering if this actually works. kprobe_flush_task() seems to be called from finish_task_switch(), after the task switch is complete. So, current task won't actually be dead here.


- Naveen