Re: [PATCH] [RFC] drivers: dma-coherent: use MEMREMAP_WB instead of MEMREMAP_WC

From: Brian Starkey
Date: Thu Nov 10 2016 - 04:52:13 EST


Hi Jaewon,

On Thu, Nov 10, 2016 at 10:41:43AM +0900, Jaewon Kim wrote:
Hi

On 2016ë 11ì 09ì 19:23, Brian Starkey wrote:
Hi,

On Wed, Nov 09, 2016 at 06:47:26PM +0900, Jaewon Kim wrote:


On 2016ë 11ì 09ì 18:27, Brian Starkey wrote:
Hi Jaewon,

On Wed, Nov 09, 2016 at 06:10:09PM +0900, Jaewon Kim wrote:
Commit 6b03ae0d42bf (drivers: dma-coherent: use MEMREMAP_WC for DMA_MEMORY_MA)
added MEMREMAP_WC for DMA_MEMORY_MAP. If, however, CPU cache can be used on
DMA_MEMORY_MAP, I think MEMREMAP_WC can be changed to MEMREMAP_WB. On my local
ARM device, memset in dma_alloc_from_coherent sometimes takes much longer with
MEMREMAP_WC compared to MEMREMAP_WB.

Test results on AArch64 by allocating 4MB with putting trace_printk right
before and after memset.
MEMREMAP_WC : 11.0ms, 5.7ms, 4.2ms, 4.9ms, 5.4ms, 4.3ms, 3.5ms
MEMREMAP_WB : 0.7ms, 0.6ms, 0.6ms, 0.6ms, 0.6ms, 0.5ms, 0.4 ms


This doesn't look like a good idea to me. The point of coherent memory
is to have it non-cached, however WB will make writes hit the cache.

Writing to the cache is of course faster than writing to RAM, but
that's not what we want to do here.

-Brian

Hi Brian

Thank you for your comment.
If allocated memory will be used by TZ side, however, I think cacheable
also can be used to be fast on memset in dma_alloc_from_coherent.

Are you trying to share the buffer between the secure and non-secure
worlds on the CPU? In that case, I don't think caching really helps
you. I'm not a TZ expert, but I believe the two worlds can never
share cached data.
I do not want memory sharing between the secure and non-secure worlds.
I just want faster allocation.

So now I'm confused. Why did you mention TZ?

Could you explain what your use-case for the buffer you are allocating
is?


I am not a TZ expert, either. I also think they cannot share cached data.
As far as I know secure world can decide its cache policy with secure
world page table regardless of non-secure world.
If you want the secure world to see the non-secure world's data, as
far as I know you will need to clean the cache in the non-secure world
to make sure the secure world can see it (and vice-versa). I'd expect
this to remove most of the speed advantage of using WB in the first
place, except for some possible speedup from more efficient bursting.
Yes I also think non-secure world need to clean the cache before secure world
access the memory region to avoid invalid data issue. But if other software
like Linux driver or hypervisor do the cache cleaning, or engineer confirm,
then we may be able to use MEMREMAP_WB or just skipping memset for faster
memory allocation in dma_alloc_from_coherent.

Skipping the memset doesn't sound like a good plan, you'd potentially
be leaking information to whatever device receives the memory. And
adding WB mappings to an API which is intended to be used for sharing
buffers with DMA masters doesn't sound like a good move either.


If you're sharing the buffer with other DMA masters, regardless of
secure/non-secure you're not going to want WB mappings.

If there is a scenario where another DMA master works on this memory,
an engineer, I think, need to consider cache clean if he/she uses WB.

The whole point of dma-coherent memory is for use by DMA masters.

How do you think to add another flag to distinguish this case?

You could look into the streaming DMA API. It will depend on the exact
implementation, but at some point you're still going to have to pay
the penalty of syncing the CPU and device.

-Brian

I cannot find DMA API and flag for WB. So I am considering additional flag
to meet my request. In my opinion the flag can be either WB or non-zeroing.

To me, it sounds like dma-coherent isn't the right tool to achieve
what you want, but I'm not clear on exactly what it is you're trying
to do (I know you want faster allocations, the point is what for?)

-Brian


For case #1 - DMA_MEMORY_MAP_WB
--- a/drivers/base/dma-coherent.c
+++ b/drivers/base/dma-coherent.c
@@ -32,7 +32,9 @@ static bool dma_init_coherent_memory(
if (!size)
goto out;

- if (flags & DMA_MEMORY_MAP)
+ if (flags & DMA_MEMORY_MAP_WB)
+ mem_base = memremap(phys_addr, size, MEMREMAP_WB);
+ else if (flags & DMA_MEMORY_MAP)
mem_base = memremap(phys_addr, size, MEMREMAP_WC);
else
mem_base = ioremap(phys_addr, size);

For case #2 - DMA_MEMORY_MAP_NOZEROING
--- a/drivers/base/dma-coherent.c
+++ b/drivers/base/dma-coherent.c
@@ -190,6 +190,8 @@ int dma_alloc_from_coherent(struct device *dev, ssize_t size,
*ret = mem->virt_base + (pageno << PAGE_SHIFT);
dma_memory_map = (mem->flags & DMA_MEMORY_MAP);
spin_unlock_irqrestore(&mem->spinlock, flags);
+ if (mem->flags & DMA_MEMORY_MAP_NOZEROING)
+ return 1;
if (dma_memory_map)
memset(*ret, 0, size);
else

Can I get your comment?

Thank you
Jaewon Kim

Signed-off-by: Jaewon Kim <jaewon31.kim@xxxxxxxxxxx>
---
drivers/base/dma-coherent.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/dma-coherent.c b/drivers/base/dma-coherent.c
index 640a7e6..0512a1d 100644
--- a/drivers/base/dma-coherent.c
+++ b/drivers/base/dma-coherent.c
@@ -33,7 +33,7 @@ static bool dma_init_coherent_memory(
goto out;

if (flags & DMA_MEMORY_MAP)
- mem_base = memremap(phys_addr, size, MEMREMAP_WC);
+ mem_base = memremap(phys_addr, size, MEMREMAP_WB);
else
mem_base = ioremap(phys_addr, size);
if (!mem_base)
--
1.9.1