Re: Memory barrier needed with wake_up_process()?

From: Peter Zijlstra
Date: Tue Sep 06 2016 - 07:37:04 EST


On Mon, Sep 05, 2016 at 11:29:26AM -0400, Alan Stern wrote:
> On Mon, 5 Sep 2016, Peter Zijlstra wrote:
>
> > > Actually, seeing it written out like this, one realizes that it really
> > > ought to be:
> > >
> > > CPU 0 CPU 1
> > > ----- -----
> > > Start DMA Handle DMA-complete irq
> > > Sleep until bh->state smp_mb()
> > > set bh->state
> > > Wake up CPU 0
> > > smp_mb()
> > > Compute rc based on contents of the DMA buffer
> > >
> > > (Bear in mind also that on some platforms, the I/O operation is carried
> > > out by PIO rather than DMA.)

> > Also, smp_mb() doesn't necessarily interact with MMIO/DMA at all IIRC.
> > Its only defined to do CPU/CPU interactions.
>
> Suppose the DMA master finishes filling up an input buffer and issues a
> completion irq to CPU1. Presumably the data would then be visible to
> CPU1 if the interrupt handler looked at it. So if CPU1 executes smp_mb
> before setting bh->state and waking up CPU0, and if CPU0 executes
> smp_mb after testing bh->state and before reading the data buffer,
> wouldn't CPU0 then see the correct data in the buffer? Even if CPU0
> never did go to sleep?

Couple of notes here; I would expect the DMA master to make its stores
_globally_ visible on 'completion'. Because I'm not sure our smp_mb()
would help make it globally visible, since its only defined on CPU/CPU
interactions, not external.

Note that for example ARM has the distinction where smp_mb() uses
DMB-ISH barrier, dma_[rw]mb() uses DMB-OSH{LD,ST} and mb() uses DSB-SY.

The ISH domain is the Inner-SHarable (IIRC) and only includes CPUs. The
OSH is Outer-SHarable and adds external agents like DMA (also includes
CPUs). The DSB-SY thing is even heavier and syncs world or something; I
always forget these details.

> Or would something more be needed?

The thing is, this is x86 (TSO). Most everything is globally
ordered/visible and full barriers.

The only reorder allowed by TSO is a later read can happen before a
prior store. That is, only:

X = 1;
smp_mb();
r = Y;

actually needs a full barrier. All the other variants are no-ops.

Note that x86's dma_[rw]mb() are no-ops (like the regular smp_[rw]mb()).
Which seems to imply DMA is coherent and globally visible.

> > I would very much expect the device IO stuff to order things for us in
> > this case. "Start DMA" should very much include sufficient fences to
> > ensure the data under DMA is visible to the DMA engine, this would very
> > much include things like flushing store buffers and maybe even writeback
> > caches, depending on platform needs.
> >
> > At the same time, I would expect "Handle DMA-complete irq", even if it
> > were done with a PIO polling loop, to guarantee ordering against later
> > operations such that 'complete' really means that.
>
> That's what I would expect too.
>
> Back in the original email thread where the problem was first reported,
> Felipe says that the problem appears to be something else. Here's what
> it looks like now, in schematic form:
>
> CPU0
> ----
> get_next_command():
> while (bh->state != BUF_STATE_EMPTY)
> sleep_thread();
> start an input request for bh
> while (bh->state != BUF_STATE_FULL)
> sleep_thread();
>
> As mentioned above, the input involves DMA and is terminated by an irq.
> The request's completion handler is bulk_out_complete():
>
> CPU1
> ----
> bulk_out_complete():
> bh->state = BUF_STATE_FULL;
> wakeup_thread();
>
> According to Felipe, when CPU0 wakes up and checks bh->state, it sees a
> value different from BUF_STATE_FULL. So it goes back to sleep again
> and doesn't make any forward progress.
>
> It's possible that something else is changing bh->state when it
> shouldn't. But if this were the explanation, why would Felipe see that
> the problem goes away when he changes the memory barriers in
> sleep_thread() and wakeup_thread() to smp_mb()?

Good question, let me stare at the code more.

Could you confirm that bulk_{in,out}_complete() work on different
usb_request structures, and they can not, at any time, get called on the
_same_ request?