Re: [PATCH v2] acpi/ghes: Prevent sleeping with spinlock held

From: Ira Weiny
Date: Wed Feb 14 2024 - 11:41:15 EST


Jonathan Cameron wrote:
> On Tue, 06 Feb 2024 14:15:32 -0800
> Ira Weiny <ira.weiny@xxxxxxxxx> wrote:
>
> > Smatch caught that cxl_cper_post_event() is called with a spinlock held
> > or preemption disabled.[1] The callback takes the device lock to
> > perform address translation and therefore might sleep. The record data
> > is released back to BIOS in ghes_clear_estatus() which requires it to be
> > copied for use in the workqueue.
> >
> > Copy the record to a lockless list and schedule a work item to process
> > the record outside of atomic context.
> >
> > [1] https://lore.kernel.org/all/b963c490-2c13-4b79-bbe7-34c6568423c7@moroto.mountain/
> >
> > Reported-by: Dan Carpenter <dan.carpenter@xxxxxxxxxx>
> > Signed-off-by: Ira Weiny <ira.weiny@xxxxxxxxx>
>
> +CC tracing folk for the splat below (the second one as first one is fixed!)
>
> Lock debugging is slow (on an emulated machine) :(
> Testing with my gitlab.com/jic23/qemu cxl-2024-02-05-draft branch
> which is only one I've put out with the FW first injection patches so far
>
> For reference without this patch I got a nice spat identifying the original bug.
> So far so good.
>
> With this patch (and tp_printk in command line and trace points enabled)
> [ 193.048229] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> [ 193.049636] {1}[Hardware Error]: event severity: recoverable
> [ 193.050472] {1}[Hardware Error]: Error 0, type: recoverable
> [ 193.051337] {1}[Hardware Error]: section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6
> [ 193.052270] {1}[Hardware Error]: section length: 0x90
> [ 193.053402] {1}[Hardware Error]: 00000000: 00000090 00000007 00000000 0d938086 ................
> [ 193.055036] {1}[Hardware Error]: 00000010: 000e0000 00000000 00000005 00000000 ................
> [ 193.058592] {1}[Hardware Error]: 00000020: 00000180 00000000 1626fa24 17b3b158 ........$.&.X...
> [ 193.062289] {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 ................
> [ 193.065959] {1}[Hardware Error]: 00000040: 000007d0 00000000 0fc00307 05210300 ..............!.
> [ 193.069782] {1}[Hardware Error]: 00000050: 72690000 6d207361 00326d65 00000000 ..iras mem2.....
> [ 193.072760] {1}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000 ................
> [ 193.074062] {1}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000 ................
> [ 193.075346] {1}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000 ................
>
> I rebased after this so now we get the smaller print - but that's not really relevant here.
>
> [ 193.086589] cxl_general_media: memdev=mem1 host=0000:0e:00.0 serial=5 log=Warning : time=1707903675590441508 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=128 flags='0x1' handle=0 related_handle=0 maint_op_class=0 : dpa=7c0 dpa_flags='0x10' descriptor='UNCORRECTABLE_EVENT|THRESHOLD_EVENT|POISON_LIST_OVERFLOW' type='0x3' transaction_type='0xc0' channel=3 rank=33 device=5 comp_id=69 72 61 73 20 6d 65 6d 32 00 00 00 00 00 00 00 validity_flags='CHANNEL|RANK|DEVICE|COMPONENT'
> [ 193.087181]
> [ 193.087361] =============================
> [ 193.087399] [ BUG: Invalid wait context ]
> [ 193.087504] 6.8.0-rc3 #1200 Not tainted
> [ 193.087660] -----------------------------
> [ 193.087858] kworker/3:0/31 is trying to lock:
> [ 193.087966] ffff0000c0ce1898 (&port_lock_key){-.-.}-{3:3}, at: pl011_console_write+0x148/0x1c8
> [ 193.089754] other info that might help us debug this:
> [ 193.089820] context-{5:5}[ 193.089900] 8 locks held by kworker/3:0/31:
> [ 193.089990] #0: ffff0000c0018738 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x154/0x500
> [ 193.090439] #1: ffff800083793de0 (cxl_cper_work){+.+.}-{0:0}, at: process_one_work+0x154/0x500
> [ 193.090718] #2: ffff800082883310 (cxl_cper_rw_sem){++++}-{4:4}, at: cxl_cper_work_fn+0x2c/0xb0
> [ 193.091019] #3: ffff0000c0a7b1a8 (&dev->mutex){....}-{4:4}, at: pci_dev_lock+0x28/0x48
> [ 193.091431] #4: ffff800082738f18 (tracepoint_iter_lock){....}-{2:2}, at: trace_event_buffer_commit+0xd8/0x2c8
> [ 193.091772] #5: ffff8000826b3ce8 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x124/0x398
> [ 193.092031] #6: ffff8000826b3d40 (console_srcu){....}-{0:0}, at: console_flush_all+0x88/0x4b0
> [ 193.092363] #7: ffff8000826b3ef8 (console_owner){....}-{0:0}, at: console_flush_all+0x1bc/0x4b0
> [ 193.092799] stack backtrace:
> [ 193.092973] CPU: 3 PID: 31 Comm: kworker/3:0 Not tainted 6.8.0-rc3 #1200
> [ 193.093118] Hardware name: QEMU QEMU Virtual Machine, BIOS unknown unknown
> [ 193.093468] Workqueue: events cxl_cper_work_fn
> [ 193.093782] Call trace:
> [ 193.094004] dump_backtrace+0xa4/0x130
> [ 193.094145] show_stack+0x20/0x38
> [ 193.094231] dump_stack_lvl+0x60/0xb0
> [ 193.094315] dump_stack+0x18/0x28
> [ 193.094395] __lock_acquire+0x9e8/0x1ee8
> [ 193.094477] lock_acquire+0x118/0x2e8
> [ 193.094557] _raw_spin_lock+0x50/0x70
> [ 193.094820] pl011_console_write+0x148/0x1c8
> [ 193.094904] console_flush_all+0x218/0x4b0
> [ 193.094985] console_unlock+0x74/0x140
> [ 193.095066] vprintk_emit+0x230/0x398
> [ 193.095146] vprintk_default+0x40/0x58
> [ 193.095226] vprintk+0x98/0xb0
> [ 193.095306] _printk+0x64/0x98
> [ 193.095385] trace_event_buffer_commit+0x138/0x2c8
> [ 193.095467] trace_event_raw_event_cxl_general_media+0x1a8/0x280 [cxl_core]
> [ 193.096191] __traceiter_cxl_general_media+0x50/0x78 [cxl_core]
> [ 193.096359] cxl_event_trace_record+0x204/0x2d0 [cxl_core]
> [ 193.096520] cxl_cper_event_call+0xb0/0xe0 [cxl_pci]
>
> The rw_sem is held to protect the callback pointer.
>
> [ 193.096713] cxl_cper_work_fn+0x7c/0xb0
> [ 193.096808] process_one_work+0x1f4/0x500
> [ 193.096891] worker_thread+0x1f0/0x3f0
> [ 193.096971] kthread+0x110/0x120
> [ 193.097052] ret_from_fork+0x10/0x20
>
> So I'm not sure how to fix this or if we even want to.
>
> We could try and release locks before calling the tracepoint but that looks
> very fiddly and would require ghes.c to be able to see more of the
> CXL tracepoint infrastructure which isn't great.
>
> Just because I was feeling cheeky I did a quick test with following.
> I have a sneaky suspicion this won't got down well even though it 'fixes'
> our issue... My google fu / lore search terms are failing to find
> much previous discussion of this.
>
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 9ff8a439d674..7ee45f22f56f 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2957,7 +2957,7 @@ static void output_printk(struct trace_event_buffer *fbuffer)
> iter->ent = fbuffer->entry;
> event_call->event.funcs->trace(iter, 0, event);
> trace_seq_putc(&iter->seq, 0);
> - printk("%s", iter->seq.buffer);
> + printk_deferred("%s", iter->seq.buffer);
>
> raw_spin_unlock_irqrestore(&tracepoint_iter_lock, flags);
> }
>
> My assumption is similar views will apply here to Peter Zijlstra's comment on
> not using printk_deferred.
>
> https://lore.kernel.org/all/20231010141244.GM377@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
>
> Note I also tried the hacks Peter links to from there. They trip issues in the initial
> CPER print - so I assume not appropriate here.
>
> So I'm thinking this is a won't fix - wait for the printk rework to land and
> assume this will be resolved as well?
>

Or could we avoid the situation entirely by using a static call?

The work queue still needs to be created because of the atomicness of
ghes_do_proc() but it avoids cxl_cper_rw_sem.

I think the hardest thing may be writing the commit message to explain all
this. :-(

Ira