Re: [PATCH v5 6/6] swiotlb: Remove pointless stride adjustment for allocations >= PAGE_SIZE

From: Robin Murphy
Date: Fri Mar 01 2024 - 11:41:56 EST


On 2024-03-01 3:39 pm, Petr Tesařík wrote:
On Thu, 29 Feb 2024 16:47:56 +0100
Christoph Hellwig <hch@xxxxxx> wrote:

On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote:
Any thoughts on how that historical behavior should apply if
the DMA min_align_mask is non-zero, or the alloc_align_mask
parameter to swiotbl_tbl_map_single() is non-zero? As currently
used, alloc_align_mask is page aligned if the IOMMU granule is
= PAGE_SIZE. But a non-zero min_align_mask could mandate
returning a buffer that is not page aligned. Perhaps do the
historical behavior only if alloc_align_mask and min_align_mask
are both zero?

I think the driver setting min_align_mask is a clear indicator
that the driver requested a specific alignment and the defaults
don't apply. For swiotbl_tbl_map_single as used by dma-iommu
I'd have to tak a closer look at how it is used.

I'm not sure it helps in this discussion, but let me dive into a bit
of ancient history to understand how we ended up here.

IIRC this behaviour was originally motivated by limitations of PC AT
hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow
usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM
added a page register, but it was on a separate chip and it did not
increment when the 8237 address rolled over back to zero. Effectively,
the page register selected a 64K-aligned window of 64K buffers.
Consequently, DMA buffers could not cross a 64K physical boundary.

Thanks to how the buddy allocator works, the 64K-boundary constraint
was satisfied by allocation size, and drivers took advantage of it when
allocating device buffers. IMO software bounce buffers simply followed
the same logic that worked for buffers allocated by the buddy allocator.

OTOH min_align_mask was motivated by NVME which prescribes the value of
a certain number of low bits in the DMA address (for simplicity assumed
to be identical with the same bits in the physical address).

The only pre-existing user of alloc_align_mask is x86 IOMMU code, and
IIUC it is used to guarantee that unaligned transactions do not share
the IOMMU granule with another device. This whole thing is weird,
because swiotlb_tbl_map_single() is called like this:

aligned_size = iova_align(iovad, size);
phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size,
iova_mask(iovad), dir, attrs);

Here:

* alloc_size = iova_align(iovad, size)
* alloc_align_mask = iova_mask(iovad)

Now, iova_align() rounds up its argument to a multiple of iova granule
and iova_mask() is simply "granule - 1". This works, because granule
size must be a power of 2, and I assume it must also be >= PAGE_SIZE.

Not quite, the granule must *not* be larger than PAGE_SIZE (but can be smaller).
In that case, the alloc_align_mask argument is not even needed if you
adjust the code to match documentation---the resulting buffer will be
aligned to a granule boundary by virtue of having a size that is a
multiple of the granule size.

I think we've conflated two different notions of "allocation" here. The page-alignment which Christoph quoted applies for dma_alloc_coherent(), which nowadays *should* only be relevant to SWIOTLB code in the swiotlb_alloc() path for stealing coherent pages out of restricted DMA pools. Historically there was never any official alignment requirement for the DMA addresses returned by dma_map_{page,sg}(), until min_align_mask was introduced.

To sum it up:

1. min_align_mask is by far the most important constraint. Devices will
simply stop working if it is not met.
2. Alignment to the smallest PAGE_SIZE order which is greater than or
equal to the requested size has been documented, and some drivers
may rely on it.

Strictly it stopped being necessary since fafadcd16595 when the historical swiotlb_alloc() was removed, but 602d9858f07c implies that indeed the assumption of it for streaming mappings had already set in. Of course NVMe is now using min_align_mask, but I'm not sure if it's worth taking the risk to find out who else should also be.

3. alloc_align_mask is a misguided fix for a bug in the above.

No, alloc_align_mask is about describing the explicit requirement rather than relying on any implicit behaviour, and thus being able to do the optimal thing for, say, a 9KB mapping given a 4KB IOVA granule and 64KB PAGE_SIZE.

Thanks,
Robin.


Correct me if anything of the above is wrong.

HTH
Petr T