Re: [PATCH] dmaengine: idxd: Change wmb() to smp_wmb() when copying completion record to user space

From: Boqun Feng
Date: Tue Jan 30 2024 - 14:55:18 EST


On Tue, Jan 30, 2024 at 05:58:24PM +0000, Mark Rutland wrote:
> This patch might be ok (it looks reasonable as an optimization), but I think
> the description of wmb() and smp_wmb() is incorrect. I also think that you're

Agreed. A wmb() -> smp_wmb() change can only be an optimization rather
than a fix.

> missing an rmb()/smp_rmb()eor equivalent on the reader side.
>
> On Mon, Jan 29, 2024 at 06:58:06PM -0800, Fenghua Yu wrote:
> > wmb() is used to ensure status in the completion record is written
> > after the rest of the completion record, making it visible to the user.
> > However, on SMP systems, this may not guarantee visibility across
> > different CPUs.
> >
> > Considering this scenario that event log handler is running on CPU1 while
> > user app is polling completion record (cr) status on CPU2:
> >
> > CPU1 CPU2
> > event log handler user app
> >
> > 1. cr = 0 (status = 0)
> > 2. copy X to user cr except "status"
> > 3. wmb()
> > 4. copy Y to user cr "status"
> > 5. poll status value Y
> > 6. read rest cr which is still 0.
> > cr handling fails
> > 7. cr value X visible now
> >
> > Although wmb() ensure value Y is written and visible after X is written
> > on CPU1, the order is not guaranteed on CPU2. So user app may see status
> > value Y while cr value X is still not visible yet on CPU2. This will
> > cause reading 0 from the rest of cr and cr handling fails.
>
> The wmb() on CPU1 ensures the order of the reads, but you need an rmb() on CPU2
> between reading the 'status' and 'rest' parts; otherwise CPU2 (or the
> compiler!) is permitted to hoist the read of 'rest' early, before reading from
> 'status', and hence you can end up with a sequence that is effectively:
>
> CPU1 CPU2
> event log handler user app
>
> 1. cr = 0 (status = 0)
> 6a. read rest cr which is still 0.
> 2. copy X to user cr except "status"
> 3. wmb()
> 4. copy Y to user cr "status"
> 5. poll status value Y
> 6b. cr handling fails
> 7. cr value X visible now
>
> Since this is all to regular cacheable memory, it's *sufficient* to use
> smp_wmb() and smp_rmb(), but that's an optimization rather than an ordering
> fix.
>
> Note that on x86_64, TSO means that the stores are in-order (and so smp_wmb()
> is just a compiler barrier), and IIUC loads are not reordered w.r.t. other
> loads (and so smp_rmb() is also just a compiler barrier).
>
> > Changing wmb() to smp_wmb() ensures Y is written after X on both CPU1
> > and CPU2. This guarantees that user app can consume cr in right order.

A barrier can only provide ordering for memory accesses on the same CPU,
so this doesn't make any sense.

>
> This implies that smp_wmb() is *stronger* than wmb(), whereas smp_wmb() is
> actually *weaker* (e.g. on x86_64 wmb() is an sfence, whereas smp_wmb() is a
> barrier()).
>
> Thanks,
> Mark.
>
> >
> > Fixes: b022f59725f0 ("dmaengine: idxd: add idxd_copy_cr() to copy user completion record during page fault handling")
> > Suggested-by: Nikhil Rao <nikhil.rao@xxxxxxxxx>
> > Tested-by: Tony Zhu <tony.zhu@xxxxxxxxx>

Since it has a "Fixes" tag and a "Tested-by" tag, I'd assume there has
been a test w/ and w/o this patch showing it can resolve a real issue
*constantly*? If so, I think x86 might be broken somewhere.

[Cc x86 maintainers]

Regards,
Boqun

> > Signed-off-by: Fenghua Yu <fenghua.yu@xxxxxxxxx>
> > ---
> > drivers/dma/idxd/cdev.c | 5 +++--
> > 1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c
> > index 77f8885cf407..9b7388a23cbe 100644
> > --- a/drivers/dma/idxd/cdev.c
> > +++ b/drivers/dma/idxd/cdev.c
> > @@ -681,9 +681,10 @@ int idxd_copy_cr(struct idxd_wq *wq, ioasid_t pasid, unsigned long addr,
> > * Ensure that the completion record's status field is written
> > * after the rest of the completion record has been written.
> > * This ensures that the user receives the correct completion
> > - * record information once polling for a non-zero status.
> > + * record information on any CPU once polling for a non-zero
> > + * status.
> > */
> > - wmb();
> > + smp_wmb();
> > status = *(u8 *)cr;
> > if (put_user(status, (u8 __user *)addr))
> > left += status_size;
> > --
> > 2.37.1
> >
> >