[BUG] unable to handle page fault for address

From: Paul E. McKenney
Date: Wed Jul 19 2023 - 13:28:46 EST


Hello!

Just FYI for the moment, I hit a BUG in 10-hour overnight qemu/KVM-based
rcutorture testing of the TASKS01, TASKS03, TREE03, and TREE07 rcutorture
on the mainline commit:

ccff6d117d8d ("Merge tag 'perf-tools-fixes-for-v6.5-1-2023-07-18' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools")

This commit was merged with the -rcu "dev" branch, which just yesterday
passes overnight tests when based on v6.5-rc1. A web search got me the
following, which is not a very close match, but the best that there is:

https://www.spinics.net/lists/bpf/msg91848.html

Here is a representative splat out of 29 of them:

[22544.494701] BUG: unable to handle page fault for address: fffffe0000af0038
[22544.496012] #PF: supervisor read access in user mode
[22544.496985] #PF: error_code(0x0000) - not-present page
[22544.497970] IDT: 0xfffffe0000000000 (limit=0xfff) GDT: 0xfffffe355b3fd000 (limit=0x7f)
[22544.499479] LDTR: NULL
[22544.499959] TR: 0x40 -- base=0xfffffe355b3ff000 limit=0x4087
[22544.501073] PGD 1ffd2067 P4D 1ffd2067 PUD 1124067 PMD 0
[22544.502149] Oops: 0000 [#1] PREEMPT SMP PTI
[22544.502967] CPU: 0 PID: 1 Comm: init Not tainted 6.5.0-rc2-00128-g68052f2f9e35 #5844
[22544.504435] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[22544.506581] RIP: 0033:0x400181
[22544.507256] Code: 00 45 31 d2 4c 8d 44 24 f0 48 c7 44 24 f8 00 00 00 00 0f 05 b8 60 00 00 00 48 8d 7c 24 e0 0f 05 41 89 c0 85 c0 75 c6 44 89 c0 <89> c2 0f af d0 ff c0 48 63 d2 48 89 15 06 01 20 00 3d a0 86 01 00
[22544.510954] RSP: 002b:00007ffde5192288 EFLAGS: 00010283
[22544.511963] RAX: 000000000000ebe9 RBX: 0000000000000000 RCX: 00000000004001a7
[22544.513386] RDX: ffffffffd963c240 RSI: 0000000000000000 RDI: 00007ffde5192258
[22544.514751] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[22544.516079] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[22544.517452] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[22544.518818] FS:  0000000000000000 GS:  0000000000000000
[22544.519864] Modules linked in:
[22544.520653] CR2: fffffe0000af0038
[22544.521401] ---[ end trace 0000000000000000 ]---
[22544.522297] RIP: 0033:0x400181
[22544.522887] RSP: 002b:00007ffde5192288 EFLAGS: 00010283
[22544.523887] RAX: 000000000000ebe9 RBX: 0000000000000000 RCX: 00000000004001a7
[22544.525257] RDX: ffffffffd963c240 RSI: 0000000000000000 RDI: 00007ffde5192258
[22544.526623] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[22544.528016] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[22544.529439] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[22544.530909] FS:  0000000000000000(0000) GS:ffff8d101f400000(0000) knlGS:0000000000000000
[22544.532564] CS:  0033 DS: 0000 ES: 0000 CR0: 0000000080050033
[22544.533760] CR2: fffffe0000af0038 CR3: 0000000002d8e000 CR4: 00000000000006f0
[22544.535286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[22544.536738] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[22544.538228] note: init[1] exited with irqs disabled
[22544.540087] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[22544.544806] Kernel Offset: 0x9600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

The failure rates for each scenario are as follows:

TASKS01: 13 of 15
TASKS03: 7 of 15
TREE03: 4 of 15
TREE07: 5 of 15

In other words, for TASKS01, there were 15 10-hour runs, and 13 of them
hit this bug.

Except that these failures were corellated:

o TASKS01 and TASKS03 failures are in batch 17 running on
kerneltest053.05.atn6.

o TREE03 and one of the TREE07 failures are in batch 4 running
on kerneltest025.05.atn6.

o Five of the TREE07 failures are in batch 6 running on
kerneltest010.05.atn6.

All of the failures in a given batch happened at about the same time,
suggesting a reaction to some stimulus from the host system.  There
was no console output from any of these host systems during the test,
other than the usual complaint about qemu having an executable stack.

Now, perhaps this is hardware or configuration failure on those three
systems. But I have been using them for quite some time, and so why
would three fail at the same time?

Bisection will not be easy given that there were only three actual
failures in ten hours over 20 batches. If this is in fact a random
kernel failure, there would be almost a 4% chance of a false positive
on each bisection step, which gives about a 50% chance of bisecting to
the wrong place if the usual 16 steps are required.

I will give this another try this evening, Pacific Time. In the
meantime, FYI.

Thanx, Paul