Re: LKMM: Read dependencies of writes ordered by dma_wmb()?

From: Paul E. McKenney
Date: Mon Aug 16 2021 - 16:51:03 EST


On Mon, Aug 16, 2021 at 03:21:09PM -0400, Alan Stern wrote:
> On Mon, Aug 16, 2021 at 07:23:51PM +0200, Marco Elver wrote:
> > On Mon, Aug 16, 2021 at 10:59AM -0400, Alan Stern wrote:
> > [...]
> > > > One caveat is the case I'm trying to understand doesn't involve just 2
> > > > CPUs but also a device. And for now, I'm assuming that dma_wmb() is as
> > > > strong as smp_wmb() also wrt other CPUs (but my guess is this
> > > > assumption is already too strong).
> > >
> > > I'm not sure that is right. dma_wmb affects the visibility of writes to
> > > a DMA buffer from the point of view of the device, not necessarily from
> > > the point of view of other CPUs. At least, there doesn't seem to be any
> > > claim in memory-barriers.txt that it does so.
> >
> > Thanks, I thought so.
> >
> > While I could just not instrument dma_*mb() at all, because KCSAN
> > obviously can't instrument what devices do, I wonder if the resulting
> > reports are at all interesting.
> >
> > For example, if I do not make the assumption that dma_wmb==smp_smb, and
> > don't instrument dma_*mb() at all, I also get racy UAF reordered writes:
> > I could imagine some architecture where dma_wmb() propagates the write
> > to devices from CPU 0; but CPU 1 then does the kfree(), reallocates,
> > reuses the data, but then gets its data overwritten by CPU 0.
>
> Access ordering of devices is difficult to describe. How do you tell a
> memory model (either a theoretical one or one embedded in code like
> KCSAN) that a particular interrupt handler routine can't be called until
> after a particular write has enabled the device to generate an IRQ?
>
> In the case you mention, how do you tell the memory model that the code
> on CPU 1 can't run until after CPU 0 has executed a particular write, one
> which is forced by some memory barrier to occur _after_ all the potential
> overwrites its worried about?

What Alan said on the difficulty!

However, KCSAN has the advantage of not needing to specify the outcomes,
which is much of the complexity. For LKMM to do a good job of handling
devices, we would need a model of each device(!).

> > What would be more useful?
> >
> > 1. Let the architecture decide how they want KCSAN to instrument non-smp
> > barriers, given it's underspecified. This means KCSAN would report
> > different races on different architectures, but keep the noise down.
> >
> > 2. Assume the weakest possible model, where non-smp barriers just do
> > nothing wrt other CPUs.
>
> I don't think either of those would work out very well. The problem
> isn't how you handle the non-smp barriers; the problem is how you
> describe to the memory model the way devices behave.

There are some architecture-independent ordering guarantees for MMIO
which go something like this:

0. MMIO readX() and writeX() accesses to the same device are
implicitly ordered, whether relaxed or not.

1. Locking partitions non-relaxed MMIO accesses in the manner that
you would expect. For example, if CPU 0 does an MMIO write,
then releases a lock, and later CPU 1 acquires that same lock and
does an MMIO read, CPU 0's MMIO write is guaranteed to happen
before CPU 1's MMIO read. PowerPC has to jump through a few
hoops to make this happen.

Relaxed MMIO accesses such as readb_relaxed() can be reordered
with locking primitives on some architectures.

2. smp_*() memory barriers are not guaranteed to affect MMIO
accesses, especially not in kernels built with CONFIG_SMP=n.

3. The mb() memory barrier is required to order prior MMIO
accesses against subsequent MMIO accesses. The wmb() and rmb()
memory barriers are required to order prior order prior MMIO
write/reads against later MMIO writes/reads, respectively.
These memory barriers also order normal memory accesses in
the same way as their smp_*() counterparts.

4. The mmiowb() memory barrier can be slightly weaker than wmb(),
as it is in ia64, but I have lost track of the details.

5. The dma_mb(), dma_rmb(), and dma_wmb() appear to be specific
to ARMv8.

6. Non-relaxed MMIO writeX() accesses force ordering of prior
normal memory writes before any DMA initiated by the writeX().

7. Non-relaxed MMIO readX() accesses force ordering of later
normal memory reads after any DMA whose completion is reported
by the readX(). These readX() accesses are also ordered before
any subsequent delay loops.

Some more detail is available in memory-barriers.txt and in this LWN
article: https://lwn.net/Articles/698014/

I wish I could promise you that these are both fully up to date, but
it is almost certain that updates are needed.

> ...
>
> > > > In practice, my guess is no compiler and architecture combination would
> > > > allow this today; or is there an arch where it could?
> > >
> > > Probably not; reordering of reads tends to take place over time
> > > scales a lot shorter than lengthy I/O operations.
> >
> > Which might be an argument to make KCSAN's non-smp barrier
> > instrumentation arch-dependent, because some drivers might in fact be
> > written with some target architectures and their properties in mind. At
> > least it would help keep the noise down, and those architecture that
> > want to see such races certainly still could.
> >
> > Any preferences?
>
> I'm not a good person to ask; I have never used KCSAN. However...
>
> While some drivers are indeed written for particular architectures or
> systems, I doubt that they rely very heavily on the special properties of
> their target architectures/systems to avoid races. Rather, they rely on
> the hardware to behave correctly, just as non-arch-specific drivers do.
>
> Furthermore, the kernel tries pretty hard to factor out arch-specific
> synchronization mechanisms and related concepts into general-purpose
> abstractions (in the way that smp_mb() is generally available but is
> defined differently for different architectures, for example). Drivers
> tend to rely on these abstractions rather than on the arch-specific
> properties directly.
>
> In short, trying to make KCSAN's handling of device I/O into something
> arch-specific doesn't seem (to me) like a particular advantageous
> approach. Other people are likely to have different opinions.

No preconceived notions here, at least not on this topic. ;-)

Thanx, Paul