Re: [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context

From: Borislav Petkov
Date: Thu Jun 11 2015 - 04:28:10 EST


On Mon, Jun 08, 2015 at 10:26:58PM +0200, Borislav Petkov wrote:
> On Mon, Jun 08, 2015 at 08:03:08PM +0000, Luck, Tony wrote:
> > @@ -156,7 +156,8 @@ void mce_log(struct mce *mce)
> > /* Emit the trace record: */
> > trace_mce_record(mce);
> >
> > - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
> > + mce_genpool_add(mce);
> > + irq_work_queue(&mce_irq_work);
> >
> > Is it safe to call irq_work_queue() from MCE context?
>
> Yeah, we're using it in contexts like:

Yeah, so that is safe but there's another issue which I triggered
yesterday. I hope the commit message explains it thoroughly. If you see
holes, please shout. I haven't run this on the EINJ box yet, that's
still outstanding.

In any case, I'm thinking of not rushing those changes to tip yet and
have them simmer in linux-next for a whole round and have them ready for
4.3. Just to be on the safe side and have some more time hammering on
them.

---
From: Borislav Petkov <bp@xxxxxxx>
Date: Wed, 10 Jun 2015 18:49:18 +0200
Subject: [PATCH] x86/mce: Fix logging of early boot machine checks

With moving the irq_work queuing to mce_log() which simplified other
code, we're faced with the problem of logging MCEs during early boot
which might be there from before. In particular, facilities which we
rely on, like workqueues, for example, are not initialized yet. See
splat below.

As a first step, move the workqueue init before the first invocation of
mce_log() in

__mcheck_cpu_init_generic
|-> machine_check_poll
|-> mce_log

But that's not enough because the normal system workqueue on which stuff
gets queued with schedule_work() is not initialized yet either.

Therefore, we still queue the work in the genpool but delay the
scheduling of that work to a late initcall where everything should be
initialized.

While at it, make sure we queue irq_work only if the addition to the
genpool has been successful.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [< (null)>] (null)
PGD 0
Oops: 0010 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.0-rc7+ #12
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
task: ffffffff8197d580 ti: ffffffff8196c000 task.ti: ffffffff8196c000
RIP: 0010:[<0000000000000000>] [< (null)>] (null)

<registers snipped>

Call Trace:
<IRQ>
? irq_work_run_list
irq_work_run
smp_irq_work_interrupt
irq_work_interrupt
<EOI>
? apic_send_IPI_self
arch_irq_work_raise
irq_work_queue
mce_log
machine_check_poll
__mcheck_cpu_init_generic
mcheck_cpu_init
identify_cpu
identify_boot_cpu
check_bugs
start_kernel
x86_64_start_reservations
x86_64_start_kernel
Code: Bad RIP value.
RIP [< (null)>] (null)
RSP <ffff88007b603f40>
CR2: 0000000000000000
---[ end trace 6d6dd507e2636bd4 ]---
Kernel panic - not syncing: Fatal exception in interrupt
---[ end Kernel panic - not syncing: Fatal exception in interrupt

Signed-off-by: Borislav Petkov <bp@xxxxxxx>
---
arch/x86/kernel/cpu/mcheck/mce.c | 33 +++++++++++++++++++++++++--------
1 file changed, 25 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index a93a150c3583..17cef061c24a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -159,8 +159,8 @@ void mce_log(struct mce *mce)
/* Emit the trace record: */
trace_mce_record(mce);

- mce_genpool_add(mce);
- irq_work_queue(&mce_irq_work);
+ if (mce_genpool_add(mce))
+ irq_work_queue(&mce_irq_work);

mce->finished = 0;
wmb();
@@ -480,7 +480,7 @@ int mce_available(struct cpuinfo_x86 *c)

static void mce_schedule_work(void)
{
- if (!mce_genpool_empty())
+ if (!mce_genpool_empty() && keventd_up())
schedule_work(&mce_work);
}

@@ -648,8 +648,9 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
if (m.status & MCI_STATUS_ADDRV) {
m.severity = severity;
m.usable_addr = mce_usable_address(&m);
- mce_genpool_add(&m);
- mce_schedule_work();
+
+ if (mce_genpool_add(&m))
+ mce_schedule_work();
}
}

@@ -1704,13 +1705,14 @@ void mcheck_cpu_init(struct cpuinfo_x86 *c)
return;
}

+ INIT_WORK(&mce_work, mce_process_work);
+ init_irq_work(&mce_irq_work, mce_irq_work_cb);
+
machine_check_vector = do_machine_check;

__mcheck_cpu_init_generic();
__mcheck_cpu_init_vendor(c);
__mcheck_cpu_init_timer();
- INIT_WORK(&mce_work, mce_process_work);
- init_irq_work(&mce_irq_work, mce_irq_work_cb);
}

/*
@@ -2565,5 +2567,20 @@ static int __init mcheck_debugfs_init(void)

return 0;
}
-late_initcall(mcheck_debugfs_init);
+#else
+static int __init mcheck_debugfs_init(void) {}
#endif
+
+static int __init mcheck_late_init(void)
+{
+ mcheck_debugfs_init();
+
+ /*
+ * Flush out everything that has been logged during early boot, now that
+ * everything has been initialized (workqueues, decoders, ...).
+ */
+ mce_schedule_work();
+
+ return 0;
+}
+late_initcall(mcheck_late_init);
--
2.3.5

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/