Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.

From: Richard W.M. Jones
Date: Wed Jul 05 2023 - 12:29:24 EST


On Tue, Jul 04, 2023 at 08:46:20AM +0100, Richard W.M. Jones wrote:
> We have been having the same sort of problem
> (https://bugzilla.redhat.com/show_bug.cgi?id=2216496). It's another
> of those bugs that requires hundreds or thousands of boot iterations
> before you can see it. There is a test in comment 27 but it requires
> guestfish and some hacking to work. I'll try to come up with a
> cleaner test later.
>
> We see stack traces like:
>
> [ 3.081939] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> [ 3.082266] clocksource: Switched to clocksource acpi_pm
> [ 3.090201] NET: Registered PF_INET protocol family
> [ 3.093098] int3: 0000 [#1] PREEMPT SMP NOPTI
> [ 3.093098] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 6.4.0-10173-ga901a3568fd2 #8
> [ 3.093098] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
> [ 3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
> [ 3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8 f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66 <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
> [ 3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
> [ 3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000280003
> [ 3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI: ffffffff8441e4b8
> [ 3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09: ffffffff8441e4b8
> [ 3.093098] R10: 0000000000000001 R11: 000000000002e990 R12: 0000000000000000
> [ 3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15: ffffffff8441e4b8
> [ 3.093098] FS: 0000000000000000(0000) GS:ffff9aa90fd80000(0000) knlGS:0000000000000000
> [ 3.093098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4: 0000000000750ee0
> [ 3.093098] PKRU: 55555554
> [ 3.093098] Call Trace:
> [ 3.093098] <TASK>
> [ 3.093098] ? die+0x31/0x80
> [ 3.093098] ? exc_int3+0x10e/0x120
> [ 3.093098] ? asm_exc_int3+0x39/0x40
> [ 3.093098] ? __mod_timer+0x1c3/0x370
> [ 3.093098] ? __mod_timer+0x1c3/0x370
> [ 3.093098] queue_delayed_work_on+0x23/0x30
> [ 3.093098] neigh_table_init+0x1bb/0x2e0
> [ 3.093098] arp_init+0x12/0x50
> [ 3.093098] inet_init+0x15b/0x2f0
> [ 3.093098] ? __pfx_inet_init+0x10/0x10
> [ 3.093098] do_one_initcall+0x58/0x230
> [ 3.093098] kernel_init_freeable+0x199/0x2d0
> [ 3.093098] ? __pfx_kernel_init+0x10/0x10
> [ 3.093098] kernel_init+0x15/0x1b0
> [ 3.093098] ret_from_fork+0x2c/0x50
> [ 3.093098] </TASK>
> [ 3.093098] Modules linked in:
> [ 3.093098] ---[ end trace 0000000000000000 ]---
> [ 3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
> [ 3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8 f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66 <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
> [ 3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
> [ 3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000280003
> [ 3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI: ffffffff8441e4b8
> [ 3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09: ffffffff8441e4b8
> [ 3.093098] R10: 0000000000000001 R11: 000000000002e990 R12: 0000000000000000
> [ 3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15: ffffffff8441e4b8
> [ 3.093098] FS: 0000000000000000(0000) GS:ffff9aa90fd80000(0000) knlGS:0000000000000000
> [ 3.093098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4: 0000000000750ee0
> [ 3.093098] PKRU: 55555554
> [ 3.093098] Kernel panic - not syncing: Fatal exception in interrupt
>
> There are many variations, but the common pattern seems to be
> <something in the clock or timer code> -> int3 exception
>
> It only happens under qemu TCG (software emulation).
>
> It goes away if we recompile qemu without MTTCG support.
>
> It only happens with -smp enabled, we are using qemu -smp 4
>
> We are using qemu-system-x86_64 full system emulation on x86_64 host
> (ie. forcing KVM off).
>
> It happens with the latest upstream kernel and qemu, compiled from
> source.

I got a bit further on this one.

The bug happens in the code that updates the static branch used for at
least these two keys:

static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable);
DEFINE_STATIC_KEY_FALSE(timers_migration_enabled);

There are probably others (it seems a generic problem with how static
branches are handled by TCG), but I only see the bug for those two.

When the static branch is updated we end up calling
arch/x86/kernel/alternative.c:text_poke_bp_batch(). It's best to read
the description of that function to see where int3 is used:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/alternative.c#n2086

I modified the qemu TCG int3 helper so it dumps the code at %rip when
the interrupt fires, and I can actually see the changes in the above
function happen, first int3 being written, then the end of the nop,
then the int3 being overwritten with the first byte of the nop.

Unfortunately the int3 still fires after the code has been completely
rewritten to its final (ie nop) value.

This seems to indicate to me that neither the self-write to the kernel
text segment, nor sync_core (implemented by a "iret to self" trick)
invalidates the qemu TCG translation block containing the old int3
helper call. Thus we (qemu) never "see" the new nop, we keep
emulating int3, and that causes the kernel to crash.

I added print statements inside tb_invalidate_phys_page() and this
function seems never to be called at all. It's my understanding that
at least the kernel writing to its text segment ought to cause
tb_invalidate_phys_page() to be called, but I'm not super-familiar
with this qemu code.

Richard Henderson - do you have any suggestions?

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines. Supports shell scripting,
bindings from many languages. http://libguestfs.org