Re: [PATCH] x86/split_lock: fix delayed detection enabling

From: Guilherme G. Piccoli
Date: Sun Mar 31 2024 - 13:25:12 EST

Next message: syzbot: "[syzbot] [virtualization?] bpf-next boot error: WARNING: refcount bug in __free_pages_ok"
Previous message: Oliver Neukum: "Re: [PATCH net-next] usbnet: fix cyclical race on disconnect with work queue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 21/03/2024 16:55, Maksim Davydov wrote:
> If the warn mode with disabled mitigation mode is used, then on each cpu
> where the split lock occurred detection will be disabled in order to make
> progress and delayed work will be scheduled, which then will enable
> detection back. Now it turns out that all CPUs use one global delayed
> work structure. This leads to the fact that if a split lock occurs on
> several CPUs at the same time (within 2 jiffies), only one cpu will
> schedule delayed work, but the rest will not. The return value of
> schedule_delayed_work_on() would have shown this, but it is not checked
> in the code
> In order to fix the warn mode with disabled mitigation mode, delayed work
> has to be a per-cpu.
>
> Fixes: 727209376f49 ("x86/split_lock: Add sysctl to control the misery mode")

Thanks Maksim! I confess I (think I) understand the theory behind the
possible problem, but I'm not seeing how it happens - probably just me
being silly , but can you help me to understand it clearly?

Let's say we have 2 CPUs, CPU0 and CPU1 and we're running with
sld_mitigate = 0, meaning we don't have "the misery".

If the code running in CPU0 reaches split_lock_warn(), my understanding
is that it warns the user, schedule the sld reenable [via and
schedule_delayed_work_on()] and disables the feature with
sld_update_msr(false), correct? So, does this disabling happens only at
core level, or it disables for the whole CPU including all cores?

But back to our example, if CPU1 detects the split lock, it'll run the
same procedure as CPU0 did - so are you saying we have a race there if
CPU1 face a split lock before CPU0 disabled the MSR?

Maybe a more clear example of the issue would be even helpful in the
commit message, showing the path both CPUs would take and how the
problem happens exactly.

Thanks in advance,

Guilherme

Next message: syzbot: "[syzbot] [virtualization?] bpf-next boot error: WARNING: refcount bug in __free_pages_ok"
Previous message: Oliver Neukum: "Re: [PATCH net-next] usbnet: fix cyclical race on disconnect with work queue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]