Re: [PATCH 4/9] dmaengine: dw-edma: HDMA: Add memory barrier before starting the DMA transfer in remote setup

From: Serge Semin
Date: Tue Jun 20 2023 - 07:46:16 EST


On Mon, Jun 19, 2023 at 08:32:07PM +0200, Köry Maincent wrote:
> On Mon, 19 Jun 2023 20:02:01 +0300
> Serge Semin <fancer.lancer@xxxxxxxxx> wrote:
>
> > On Fri, Jun 09, 2023 at 10:16:49AM +0200, Köry Maincent wrote:
> > > From: Kory Maincent <kory.maincent@xxxxxxxxxxx>
> > >
> >
> > > The Linked list element and pointer are not stored in the same memory as
> > > the HDMA controller register. If the doorbell register is toggled before
> > > the full write of the linked list a race condition error can appears.
> > > In remote setup we can only use a readl to the memory to assured the full
> > > write has occurred.
> > >
> > > Fixes: e74c39573d35 ("dmaengine: dw-edma: Add support for native HDMA")
> > > Signed-off-by: Kory Maincent <kory.maincent@xxxxxxxxxxx>
> >
> > Is this a hypothetical bug? Have you actually experienced the
> > described problem? If so are you sure that it's supposed to be fixed
> > as you suggest?
>

> I do experienced this problem and this patch fixed it.

Could you give more details of how often does it happen? Is it stably
reproducible or does it happen at very rare occasion?

>
> >
> > I am asking because based on the kernel doc
> > (Documentation/memory-barriers.txt):
> >
> > * 1. All readX() and writeX() accesses to the same peripheral are ordered
> > * with respect to each other. This ensures that MMIO register accesses
> > * by the same CPU thread to a particular device will arrive in program
> > * order.
> > * ...
> > * The ordering properties of __iomem pointers obtained with non-default
> > * attributes (e.g. those returned by ioremap_wc()) are specific to the
> > * underlying architecture and therefore the guarantees listed above cannot
> > * generally be relied upon for accesses to these types of mappings.
> >
> > the IOs performed by the accessors are supposed to arrive in the
> > program order. Thus SET_CH_32(..., HDMA_V0_DOORBELL_START) performed
> > after all the previous SET_CH_32(...) are finished looks correct with
> > no need in additional barriers. The results of the later operations
> > are supposed to be seen by the device (in our case it's a remote DW
> > eDMA controller) before the doorbell update from scratch. From that
> > perspective your problem looks as if the IO operations preceding the
> > doorbell CSR update aren't finished yet. So you are sure that the LL
> > memory is mapped with no additional flags like Write-Combine or some
> > caching optimizations? Are you sure that the PCIe IOs are correctly
> > implemented in your platform?
>
> No, I don't know if there is extra flags or optimizations.

Well, I can't know that either.) The only one who can figure it out is
you, at least at this stage (I doubt Gustavo will ever get back to
reviewing and testing the driver on his remote eDMA device). I can
help if you provide some more details about the platform you are
using, about the low-level driver (is it
drivers/dma/dw-edma/dw-edma-pcie.o?) which gets to detect the DW eDMA
remote device and probes it by the DW eDMA core.

* Though I don't have hardware with the remote DW eDMA setup to try to
reproduce and debug the problem discovered by you.

>
> >
> > I do understand that the eDMA CSRs and the LL memory are mapped by
> > different BARs in the remote eDMA setup. But they still belong to the
> > same device. So the IO accessors semantic described in the kernel doc
> > implies no need in additional barrier.
>
> Even if they are on the same device it is two type of memory.

What do you mean by "two types of memory"? From the CPU perspective
they are the same. Both are mapped via MMIO by means of a PCIe Root
Port outbound memory window.

> I am not an PCIe expert but I suppose the PCIe controller of the board is
> sending to both memory and if one of them (LL memory here) is slower in the
> write process then we faced this race issue. We can not find out that the write
> to LL memory has finished before the CSRs even if the write command has been
> sent earlier.

>From your description there is no guarantee that reading from the
remote device solves the race for sure. If writes have been collected
in a cache, then the intermediate read shall return a data from the
cache with no data being flushed to the device memory. It might be
possible that in your case the read just adds some delay enough for
some independent activity to flush the cache. Thus the problem you
discovered may get back in some other circumstance. Moreover based on
the PCI Express specification "A Posted Request must not pass another
Posted Request unless a TLP has RO (Relaxed ordering) or IDO (ID-based
ordering) flag set." So neither intermediate PCIe switches nor the
PCIe host controller is supposed to re-order simple writes unless the
Root Port outbound MW is configure to set the denoted flags. In anyway
all of that is platform specific. So in order to have it figured out
we need more details from the platform from you.

Meanwhile:

Q1 are you sure that neither dma_wmb() nor io_stop_wc() help to solve
the problem in your case?

Q2 Does specifying a delay instead of the dummy read before the
doorbell update solve the problem?

-Serge(y)

>
> Köry,