Re: [PATCH -tip v2] x86/kprobes: Drop removed INT3 handling code

From: Google
Date: Sun Jan 21 2024 - 04:05:58 EST


On Sun, 21 Jan 2024 11:28:52 +0900
Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx> wrote:

> On Sat, 20 Jan 2024 17:05:17 -0500
> Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> > On Sat, 20 Jan 2024 18:44:38 +0100
> > Matthieu Baerts <matttbe@xxxxxxxxxx> wrote:
> >
> > >
> > > I'm sorry to reply on a patch that is more than one year old, but in
> >
> > No problem, I've done the same.
>
> Yeah, thanks for reporting! I realized the problem.
>
> >
> > > short, it looks like it is causing a kernel panic on our side. To be
> > > honest, I don't understand the link, but I probably missed something.
> > >
> > > A bit of context: for the MPTCP net subsystem, we are testing a new CI
> > > service to launch a VM and run our test suite (kselftests, kunit, etc.).
> > > This CI (Github Action) doesn't support KVM acceleration, and QEmu
> > > (v6.2.0) falls back to TCG ("-machine accel=kvm:tcg"). Before, we were
> > > always running the tests with QEmu and KVM support, and I don't think we
> > > had this issue before. Now, in two weeks, this new CI reported 5 kernel
> > > panic in ~30 runs.
> >
> > I'm guessing that qemu doesn't do something that real hardware will do,
> > which is causing the bug.
>
> If this is the timing bug, it is not qemu's issue, but ours.

Hmm, as far as I can see, the jump_label uses text_poke_bp(_batch) which
send IPI for sync_core() on each core, after replacing INT3 with other opcode.

void text_poke_sync(void)
{
on_each_cpu(do_sync_core, NULL, 1);
}

static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
{
[...]
/*
* Third step: replace the first byte (int3) by the first byte of
* replacing opcode.
*/
for (do_sync = 0, i = 0; i < nr_entries; i++) {
u8 byte = tp[i].text[0];

if (tp[i].len == 6)
byte = 0x0f;

if (byte == INT3_INSN_OPCODE)
continue;

text_poke(text_poke_addr(&tp[i]), &byte, INT3_INSN_SIZE);
do_sync++;
}

if (do_sync)
text_poke_sync();
[...]

This must ensure those processor should finish running INT3 exception
handlers because the IPI is handled outside of the INT3 exception.

However, if the I-cache entry servives text_poke() and sync_core(), this
problem may happen.
The text_poke() flushes TLB but for the local (!global) PTE, and sync_core()
just serialize (!= cache flushing?). Thus the other CPUs can still see the
INT3 after text_poke_sync()? If so, on such CPU, removed INT3 is still
alive on the I-cache and hit it after text_poke_sync().
This will be a ghost INT3...


>
> > > I initially reported the issue on netdev [1], because the CI always got
> > > the panic when doing some pings between different netns, not using MPTCP
> > > yet. Eric Dumazet mentioned that it looks like it is an x86 issue, maybe
> > > with the jump labels. Since then, I tried to 'git bisect' the issue on
> > > my side, but it was not easy: hard to reproduce and unclear to me what
> > > is causing it.
> >
> > It could possibly be due to jump_labels/static_branch as they use int3
> > as well.
>
> Yeah, it seems like. Does jump_labels/static_branch wait until all interrupts
> exit before removing their object from the "active list"?
>
> kprobes does this but I found it might be *not enough*.
> When removing a kprobe, we do
>
> 1. disarm int3
> 2. remove kprobe from hlist ("active list") by hlist_del_rcu()
> 3. wait for rcu
> 4. free kprobe's trampoline
>
> The possible scenario is
>
> CPU0 CPU1
> 0. hit int3
> 1. disarm int3
> 2. remove kprobe from hlist
> 2.1 run do_int3()
> 2.2 kprobe_int3_handler() failed (*)
> 2.3 notify_die() failed
> 2.4 kernel panic
> 3. wait for rcu
> 4. free kprobe's trampoline
>
> (*) because the corresponding kprobe is already removed from the
> active list.
>
> Thus exc_int3() needs a check whether the int3 is already removed or not
> before it decides that int3 is a stray int3 (== returning false).
>
> Or, another possible solution is to add another synchronize_rcu()
> or sync_core() right after disarming int3 so that we ensure all
> int3 handler at that point are finished.

So this another solution is already done. I think we need to add the
ghost INT3 check in the exc_int3() as follows;

Thank you,