Re: DMA mappings and crossing boundaries

From: Robin Murphy
Date: Wed Jul 04 2018 - 08:57:09 EST


On 02/07/18 14:37, Benjamin Herrenschmidt wrote:
On Mon, 2018-07-02 at 14:06 +0100, Robin Murphy wrote:

.../...

Thanks Robin, I was starting to depair anybody would reply ;-)

AFAIK, dma_alloc_coherent() is defined (Documentation/DMA-API-
HOWTO.txt) as always allocating to the next power-of-2 order, so we
should never have the problem unless we allocate a single chunk larger
than the IOMMU page size.

(and even then it's not *that* much of a problem, since it comes down to
just finding n > 1 consecutive unused IOMMU entries for exclusive use by
that new chunk)

Yes, this case is not my biggest worry.

For dma_map_sg() however, if a request that has a single "entry"
spawning such a boundary, we need to ensure that the result mapping is
2 contiguous "large" iommu pages as well.

However, that doesn't fit well with us re-using existing mappings since
they may already exist and either not be contiguous, or partially exist
with no free hole around them.

Now, we *could* possibly construe a way to solve this by detecting this
case and just allocating another "pair" (or set if we cross even more
pages) of IOMMU pages elsewhere, thus partially breaking our re-use
scheme.

But while doable, this introduce some serious complexity in the
implementation, which I would very much like to avoid.

So I was wondering if you guys thought that was ever likely to happen ?
Do you see reasonable cases where dma_map_sg() would be called with a
list in which a single entry crosses a 256M or 1G boundary ?

For streaming mappings of buffers cobbled together out of any old CPU
pages (e.g. user memory), you may well happen to get two
physically-adjacent pages falling either side of an IOMMU boundary,
which comprise all or part of a single request - note that whilst it's
probably less likely than the scatterlist case, this could technically
happen for dma_map_{page, single}() calls too.

Could it ? I wouldn't think dma_map_page is allows to cross page
boundaries ... what about single() ? The main worry is people using
these things on kmalloc'ed memory

Oh, absolutely - the underlying operation is just "prepare for DMA to/from this physically-contiguous region"; the only real difference between map_page and map_single is for the sake of the usual "might be highmem" vs. "definitely lowmem" dichotomy. Nobody's policing any limits on the size and offset parameters (in fact, if anyone asks I would say the outcome of the big "offset > PAGE_SIZE" debate for dma_map_sg a few months back is valid for dma_map_page too, however silly it may seem).

Of course, given that the allocators tend to give out size/order-aligned chunks, I think you'd have to be pretty tricksy to get two allocations to line up either side of a large power-of-two boundary *and* go out of your way to then make a single request spanning both, but it's certainly not illegal. Realistically, the kind of "scrape together a large buffer from smaller pieces" code which is liable to hit a boundary-crossing case by sheer chance is almost certainly going to be taking the sg_alloc_table_from_pages() + dma_map_sg() route for convenience, rather than implementing its own merging and piecemeal mapping.

Conceptually it looks pretty easy to extend the allocation constraints
to cope with that - even the pathological worst case would have an
absolute upper bound of 3 IOMMU entries for any one physical region -
but if in practice it's a case of mapping arbitrary CPU pages to 32-bit
DMA addresses having only 4 1GB slots to play with, I can't really see a
way to make that practical :(

No we are talking about 40-ish-bits of address space, so there's a bit
of leeway. Of course no scheme will work if the user app tries to map
more than the GPU can possibly access.

But with newer AMD adding a few more bits and nVidia being at 47-bits,
I think we have some margin, it's just that they can't reach our
discontiguous memory with a normal 'bypass' mapping and I'd rather not
teach Linux about every single way our HW can scatter memory accross
nodes, so an "on demand" mechanism is by far the most flexible way to
deal with all configurations.

Maybe the best compromise would be some sort of hybrid scheme which
makes sure that one of the IOMMU entries always covers the SWIOTLB
buffer, and invokes software bouncing for the awkward cases.

Hrm... not too sure about that. I'm happy to limit that scheme to well
known GPU vendor/device IDs, and SW bouncing is pointless in these
cases. It would be nice if we could have some kind of guarantee that a
single mapping or sglist entry never crossed a specific boundary
though... We more/less have that for 4G already (well, we are supposed
to at least). Who are the main potential problematic subsystems here ?
I'm thinking network skb allocation pools ... and page cache if it
tries to coalesce entries before issuing the map request, does it ?

I don't know of anything definite off-hand, but my hunch is to be most wary of anything wanting to do zero-copy access to large buffers in userspace pages. In particular, sg_alloc_table_from_pages() lacks any kind of boundary enforcement (and most all users don't even use the segment-length-limiting variant either), so I'd say any caller of that currently has a very small, but nonzero, probability of spoiling your day.

Robin.