Re: Re: Uncached buffers from CMA DMA heap on some Arm devices?

From: Lucas Stach
Date: Mon Jan 29 2024 - 08:13:30 EST


Am Montag, dem 29.01.2024 um 14:07 +0200 schrieb Laurent Pinchart:
> On Mon, Jan 29, 2024 at 11:32:08AM +0100, Maxime Ripard wrote:
> > On Mon, Jan 29, 2024 at 11:23:16AM +0100, Pavel Machek wrote:
> > > Hi!
> > >
> > > > That's right and a reality you have to deal with on those small ARM
> > > > systems. The ARM architecture allows for systems that don't enforce
> > > > hardware coherency across the whole SoC and many of the small/cheap SoC
> > > > variants make use of this architectural feature.
> > > >
> > > > What this means is that the CPU caches aren't coherent when it comes to
> > > > DMA from other masters like the video capture units. There are two ways
> > > > to enforce DMA coherency on such systems:
> > > > 1. map the DMA buffers uncached on the CPU
> > > > 2. require explicit cache maintenance when touching DMA buffers with
> > > > the CPU
> > > >
> > > > Option 1 is what you see is happening in your setup, as it is simple,
> > > > straight-forward and doesn't require any synchronization points.
> > >
> > > Yeah, and it also does not work :-).
> > >
> > > Userspace gets the buffers, and it is not really equipped to work with
> > > them. For example, on pinephone, memcpy() crashes on uncached
> > > memory. I'm pretty sure user could have some kind of kernel-crashing
> > > fun if he passed the uncached memory to futex or something similar.
> >
> > Uncached buffers are ubiquitous on arm/arm64 so there must be something
> > else going on. And there's nothing to equip for, it's just a memory
> > array you can access in any way you want (but very slowly).
> >
> > How does it not work?
>
> I agree, this should just work (albeit possibly slowly). A crash is a
> sign something needs to be fixed.
>
Optimized memcpy implementations might use unligned access at the edges
of the copy regions, which will in fact not work with uncached memory,
as hardware unaligned access support on ARM(64) requires the bufferable
memory attribute, so you might see aborts in this case.

write-combined mappings are bufferable and thus don't exhibit this
issue.

> > > > Option 2 could be implemented by allocating cached DMA buffers in the
> > > > V4L2 device and then executing the necessary cache synchronization in
> > > > qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA
> > > > master. However this isn't guaranteed to be any faster, as the cache
> > > > synchronization itself is a pretty heavy-weight operation when you are
> > > > dealing with buffer that are potentially multi-megabytes in size.
> > >
> > > Yes, cache synchronization can be slow, but IIRC it was on order of
> > > milisecond in the worst case.. and copying megayte images is still
> > > slower than that.
>
> Those numbers are platform-specific, you can't assume this to be true
> everywhere.
>
Last time I looked at this was on a pretty old platform (Cortex-A9).
There the TLB walks caused by the cache maintenance by virtual address
was causing severe slowdowns, to the point where actually copying the
data performs similar to the cache maintenance within noise margins,
with the significant difference that copying actually causes the data
to be cache hot for the following operations.

Regards,
Lucas