Re: BUG: unable to handle kernel paging request in __switch_to

From: Andy Lutomirski
Date: Thu Dec 14 2017 - 13:55:12 EST


On Thu, Dec 14, 2017 at 10:42 AM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Thu, Dec 14, 2017 at 9:12 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>> On Sun, 3 Dec 2017, syzbot wrote:
>>> BUG: unable to handle kernel paging request at fffffffffffffff8
>>> Oops: 0002 [#1] SMP KASAN
>
> System write of a non-existent page.
>
>>> RIP: 0010:switch_fpu_prepare arch/x86/include/asm/fpu/internal.h:535 [inline]
>>> RIP: 0010:__switch_to+0x95b/0x1330 arch/x86/kernel/process_64.c:407
>
> This says it's
>
> old_fpu->last_cpu = cpu;
>
> and the code disassembly ends up looking something like this:
>
> 0: 48 c1 ea 03 shr $0x3,%rdx
> 4: 0f b6 04 02 movzbl (%rdx,%rax,1),%eax
> 8: 84 c0 test %al,%al
> a: 74 08 je 0x14
> c: 3c 03 cmp $0x3,%al
> e: 0f 8e d5 06 00 00 jle 0x6e9
> 14: 8b 85 70 fe ff ff mov -0x190(%rbp),%eax
> 1a: 41 89 84 24 c0 15 00 mov %eax,0x15c0(%r12)
> 21: 00
> 22:* cc int3 <-- trapping instruction
>
> where that preceding two "mov" instructions look like it might indeed be that
>
> old_fpu->last_cpu = cpu;
>
> thing, and the register state doesn't look insane for this.
>
> So I think the RIP->line encoding is slightly off, and that "int3" is
> almost certainly due to the very next thing after the write:
>
> trace_x86_fpu_regs_deactivated(old_fpu);
>
> and that actually makes sense if the test robot is doing some tracing,
> particularly if it's just about to _start_ tracing, and it has
> replaced the first byte of the instruction with 'int3' and is in the
> process of doing the rewrite.
>
> The fact that it then takes a system write fault is because some GDT
> or IDT setup is screwed up. Or possibly the stack is screwed up and
> started out as 0, and then the push to the stack would decrement the
> stack pointer and try to push the error state or something.
>
>> That's the second report I'm staring at today which has CR2
>> fffffffffffffffx and points to a faulting instruction which does not make
>> any sense at all.
>
> That actually does make sense - see above. It just requires that race
> with the instruction rewriting.
>
> *Normally* we never actually take the "int3" exception, because
> normally we'll have completed the rewrite before another CPU actually
> executes the instruction that is being rewritten.
>
> So I'm assuming this is with the page table isolation, and some
> unusual case in exception handling got screwed up.

SDM time. Assuming the CPU actually decoded int3 and tried to execute
it, I can see a couple possible outcomes:

1. Something's wrong with the IDT and it can't read the vector. I
think this would end up triple-faulting, though.

2. It actually tries to handle the breakpoint. A breakpoint is a
benign exception, so any exception encountered while delivering it
would result in serial delivery. I've never thought that serial
delivery made any sense -- presumably it just cancels the breakpoint
and delivers the other exception. So this *could* be a page fault hit
during delivery of the int3 exception. I don't believe it's a GDT
problem, though, because that would also likely lead to a triple
fault. What I *would* believe is that the IST table got messed up and
we're seeing the result of trying to push to the stack with the
initial RSP=0 so the fault hits at address -8.

I have no idea how that would happen, though. Especially since int3
from userspace would have exactly the same problem, and we exercise
that code in the selftests.