Re: [PATCH] drm: gem: add an option for supporting the dma-coherent hardware.

From: Sui Jingfeng
Date: Thu Jun 08 2023 - 04:18:09 EST


Hi,

On 2023/6/8 15:39, Maxime Ripard wrote:
On Thu, Jun 08, 2023 at 01:18:38AM +0800, Sui Jingfeng wrote:
Hi,

On 2023/6/8 00:12, Paul Cercueil wrote:
Hi Sui,

Le mercredi 07 juin 2023 à 22:38 +0800, Sui Jingfeng a écrit :
Hi,  welcome to discussion.


I have limited skills in manipulating English.

It may not express what I'm really means in the short time.

Part of word in the sentence may not as accurate as your.

Well, please don't misunderstand, I'm not doing the rude to you.
No problem.

I will explain it with more details.

See below:


On 2023/6/7 20:09, Paul Cercueil wrote:
Hi Sui,

Le mercredi 07 juin 2023 à 18:30 +0800, Sui Jingfeng a écrit :
Hi,


On 2023/6/7 17:36, Paul Cercueil wrote:
Hi Sui,

Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
The single map_noncoherent member of struct
drm_gem_dma_object
may
not
sufficient for describing the backing memory of the GEM
buffer
object.

Especially on dma-coherent systems, the backing memory is
both
cached
coherent for multi-core CPUs and dma-coherent for peripheral
device.
Say architectures like X86-64, LoongArch64, Loongson Mips64,
etc.

Whether a peripheral device is dma-coherent or not can be
implementation-dependent. The single map_noncoherent option
is
not
enough
to reflect real hardware anymore. For example, the Loongson
LS3A4000
CPU
and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
platform
allways snoop CPU's cache. Doing the allocation with
dma_alloc_coherent
function is preferred. The return buffer is cached, it should
not
using
the default write-combine mapping. While with the current
implement,
there
no way to tell the drm core to reflect this.

This patch adds cached and coherent members to struct
drm_gem_dma_object.
which allow driver implements to inform the core. Introducing
new
mappings
while keeping the original default behavior unchanged.
Did you try to simply set the "dma-coherent" property to the
device's
node?
But this approach can only be applied for the device driver with
DT
support.

X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically
do
not
have DT support.

They using ACPI to pass parameter from the firmware to Linux
kernel.

You approach will lost the effectiveness on such a case.
Well, I don't really know how ACPI handles it - but it should just
be a
matter of setting dev->dma_coherent. That's basically what the DT
code
does.

Some MIPS boards set it in their setup code for instance.

This is a *strategy*, not a *mechanism*.

In this case, DT is just used to describing the hardware.

(It is actually a hardware feature describing language, the
granularity
is large)

It does not changing the state of the hardware.

It's your platform firmware or kernel setting up code who actually do
such a things.


It's just that it works on *one* platform, it does not guarantee it
will
works on others.
If you add the "dma-coherent" property in a device node in DT, you
effectively specify that the device is DMA-coherent; so you describe
the hardware, which is what DT is for, and you are not changing the
state of the hardware.

Note that some MIPS platforms (arch/mips/alchemy/common/setup.c)
default to DMA-coherent mapping; I believe you could do something
similar with your Loongson LS3A4000 CPU and LS2K2000/LS2K1000 SoC.

The preblem is that device driver can have various demand.

It probably want to create different kind of buffers for different thing
simultaneously.

Say, one allocated with dma_alloc_coherent for command buffer or dma
descriptor

another one allocated with  dma_alloc_wc for uploading shader etc.

also has the third one allocated with dma_alloc_noncoherent() for doing some
else.
And it will work just fine.

struct device dma_coherent, or DT's dma-coherent property define that
the device doesn't need any kind of cache maintenance, ever. If it's
missing, we need to perform cache maintenance to keep coherency.

dma_alloc_* functions provide guarantees to the driver. With
dma_alloc_wc and dma_alloc_coherent, the buffer is coherent, and thus
you don't need to perform cache maintenance operations by hand in the
driver.

BO returned by dma_alloc_wc() doesn't works on some platform.

This may only guarantee for the CPU side. There is no guarantee for the GPU side.

For example, the GPU always snoop CPU's cache. The GPU fetch data from the CPU's cache if hit.

if not hit, the GPU fetch the data from the system RAM.


But when call dma_alloc_wc(), the BO at cpu side is marked as write combine property.

The write buffer within the CPU will gather the CPU side write access.

This is to say, there may have some data reside(stall) in the write buffer.

while the GPU will fetch data from the system RAM or CPU's cache.

the GPU will fetch wrong data.


This is the condition for our hardware, I don't know how does the ARM platform guarantee

the coherency in this case.


If it relay on software to guarantee, then it is still non hardware maintained coherency.


When it relay on software, I called it implement-dependent.

there are some archs without the implement or don't know how to implement.


If it can't even snoop cpu's cache, I don't believe it can snoop cpu's write buffer.

I not sure dma api can do guarantee for all arch.


With dma_alloc_noncoherent, the buffer is non-coherent and the driver
needs to perform them when relevant.

How those buffers are created is platform specific, but the guarantees
provided *to the driver* are always there.

A buffer allocated with dma_alloc_coherent might be provided by
different means (at the hardware level with a coherency unit, by mapping
it non-cacheable), but as far as the driver is concerned it's always
going to be coherent.

Similarly, a driver using dma_alloc_noncoherent will always require
cache maintenance operations to use the API properly, even if the
hardware provides coherency (in which case, those operations will be
nop).

So, yeah, like I was saying in the other mail, it looks like you're
confusing a bunch of things. dma_alloc_* functions are about the driver
expectations and guarantees. DT's dma-coherent property is about how we
can implement them on a given platform.

That is ideal situation.

You don't have seen the actual bugs.

Yeah, I do have a bit confusing about the DMA api.

Maybe you and Paul can continue work on this.


But DT's dma-coherent property is definitely not a system level solution.

drm/amdgpu, drm/radeon and drm/i915 don't support DT.

If ARM is dma-noncoherent, I suspect drm/amdgpu, drm/radeon will not works on ARM.

there no function call dma_sync_for_device() dma_sync_for_cpu() etc

These driver assume dma-coherent hardware.

They don't have to match, and that's precisely how we can have drivers
that run on any combination of platforms: the driver only cares about
the buffer guarantees, the platform description takes care of how they
are implemented.

Maxime

--
Jingfeng