RE: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

From: Luck, Tony
Date: Thu Nov 13 2014 - 13:46:43 EST


> printk seems to work just fine in do_machine_check. Any chance you
> can instrument, for each cpu, all entries to do_machine_check, all
> calls to do_machine_check, all returns, and everything that tries to
> do memory_failure?

I first added a printk() just for the cpu that calls do_machine_check()

printk("MCE: regs = %p\n", regs);

to see if something went wonky when jumping to the kernel stack.
But that printed the same value every time (same process is consuming
and recovering from errors). Maybe this took longer to hit the problem
case - I ran to 1500ish errors instead of just 400 in the previous two tests.
But not sure if that is a significant change.

Then I added printk() for every entry/return on all cpus. This just locked
up on the third injection. Serial console looked to have stopped printing
after the first - so I put in bigger delays into my test program between injection
and consumption, and before looping around for the next cycle to give
time for all the messages (4-socket HSW-EX ... there are a lot of cpus
printing messages). But now it is taking a lot longer to get through
injection/consumption iterations. At 226 now and counting.

> Also, shouldn't there be a local_irq_enable before memory_failure and
> a local_irq_disable after it? It wouldn't surprise me if you've
> deadlocked somewhere. Lockdep could also have something interesting
> to say.
Added enable/disable.

> should still be deliverable. Is it possible that we really need an
> IRET to unmask NMIs? This seems unlikely.)

If that were the problem, wouldn't we fail on iteration 2, instead of
400+ ?

-Tony