Re: [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE

From: Robin Murphy
Date: Mon Aug 09 2021 - 14:37:33 EST


On 2021-08-07 09:41, Sven Peter wrote:
Hi,

Thanks a lot for quick reply!

On Fri, Aug 6, 2021, at 20:04, Robin Murphy wrote:
On 2021-08-06 16:55, Sven Peter via iommu wrote:
DMA IOMMU domains can support hardware where the IOMMU page size is
larger than the CPU page size.
Alignments need to be done with respect to both PAGE_SIZE and
iovad->granule. Additionally, the sg list optimization to use a single
IOVA allocation cannot be used in those cases since the physical
addresses will very likely not be aligned to the larger IOMMU page size.

Signed-off-by: Sven Peter <sven@xxxxxxxxxxxxx>
---
drivers/iommu/dma-iommu.c | 87 ++++++++++++++++++++++++++++++++++-----
1 file changed, 77 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 6f0df629353f..e072d9030d9f 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -8,6 +8,7 @@
* Copyright (C) 2000-2004 Russell King
*/
+#include <linux/align.h>
#include <linux/acpi_iort.h>
#include <linux/device.h>
#include <linux/dma-map-ops.h>
@@ -51,6 +52,15 @@ struct iommu_dma_cookie {
struct iommu_domain *fq_domain;
};
+/* aligns size to CPU and IOMMU page size */
+static inline size_t iommu_page_align(struct device *dev, size_t size)
+{
+ struct iommu_domain *domain = iommu_get_dma_domain(dev);
+ struct iommu_dma_cookie *cookie = domain->iova_cookie;
+
+ return iova_align(&cookie->iovad, PAGE_ALIGN(size));
+}
+
static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled);
bool iommu_dma_forcedac __read_mostly;
@@ -647,6 +657,8 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
/*
* If size is less than PAGE_SIZE, then a full CPU page will be allocated,
* but an IOMMU which supports smaller pages might not map the whole thing.
+ * If the IOMMU page size is larger than the CPU page size, then the size
+ * will be aligned to that granularity and some memory will be left unused.

Why do we need to increase the actual memory allocation? The point here
is that we allocate the smallest thing we can allocate and map the
smallest thing we can map - I think that still works the "wrong" way
round too, we should just need to start taking an IOVA offset into
account as in dma_map_page() if we can no longer assume it's 0 for a CPU
page. Sure we may expose some unrelated adjacent pages, but we'll
already be doing that to excess for streaming DMA so whoop de do.

I agree for trusted devices, but untrusted ones (Thunderbolt, and depending on your
risk tolerance possibly even the broadcom wifi) might also end up calling this.

Oh, right, I hadn't considered actual untrusted device support at this stage.

For streaming DMA swiotlb will make sure that these won't see memory
they're not supposed to access.

I was slightly surprised to see that that does appear to work out OK, but I guess SWIOTLB slots are already smaller than typical IOMMU pages, so it falls out of that. Neat.

But, at least as far as I understand it, no swiotlb is in the way to catch devices
who end up calling this function. That wasn't required because we used to get
PAGE_SIZE aligned allocation here and every IOMMU so far would be able to easily
map them without any spill overs.
But now we'll end up exposing three more unrelated pages if the allocation
is not increased.

Fair enough, but then why still waste memory in the (usual) cases where it logically *isn't* necessary?

*/
static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
size_t size, struct sg_table *sgt, gfp_t gfp, pgprot_t prot,
@@ -736,7 +748,7 @@ static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
out_unmap:
__iommu_dma_unmap(dev, *dma_handle, size);
- __iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
+ __iommu_dma_free_pages(pages, iommu_page_align(dev, size) >> PAGE_SHIFT);
return NULL;
}
@@ -766,7 +778,8 @@ static void iommu_dma_free_noncontiguous(struct device *dev, size_t size,
struct dma_sgt_handle *sh = sgt_handle(sgt);
__iommu_dma_unmap(dev, sgt->sgl->dma_address, size);
- __iommu_dma_free_pages(sh->pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
+ __iommu_dma_free_pages(sh->pages,
+ iommu_page_align(dev, size) >> PAGE_SHIFT);
sg_free_table(&sh->sgt);
kfree(sh);
}
@@ -1006,6 +1019,31 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
if (dev_is_untrusted(dev))
return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);
+ /*
+ * If the IOMMU pagesize is larger than the CPU pagesize we will
+ * very likely run into sgs with a physical address that is not aligned
+ * to an IOMMU page boundary. Fall back to just mapping every entry
+ * independently with __iommu_dma_map then.

Scatterlist segments often don't have nicely aligned ends, which is why
we already align things to IOVA granules in main loop here. I think in
principle we'd just need to move the non-IOVA-aligned part of the
address from sg->page to sg->offset in the temporary transformation for
the rest of the assumptions to hold. I don't blame you for being timid
about touching that, though - it took me 3 tries to get right when I
first wrote it...

I was a little bit afraid I'd get this exact reply :D
I'll try to modify the transformation again, but I'm sure it'll take me more than
3 tries to get it right :)


+ */
+ if (iovad->granule > PAGE_SIZE) {
+ for_each_sg(sg, s, nents, i) {
+ sg_dma_address(s) = __iommu_dma_map(dev, sg_phys(s),
+ s->length, prot, dma_get_mask(dev));
+ if (sg_dma_address(s) == DMA_MAPPING_ERROR)
+ break;
+ sg_dma_len(s) = s->length;
+ }
+
+ if (unlikely(i != nents)) {
+ nents = i;
+ for_each_sg(sg, s, nents, i)
+ __iommu_dma_unmap(dev, sg_dma_address(s), sg_dma_len(s));
+ return 0;
+ }
+
+ return nents;
+ }

Either way, NAK to having a *third* implementation of SG mapping in this
file which is fundamentally no different from the second one.

Good point. If for some reason I'm not able to modify the transformation correctly
I'll just fall back to iommu_dma_map_sg_swiotlb.


+
/*
* Work out how much IOVA space we need, and align the segments to
* IOVA granules for the IOMMU driver to handle. With some clever
@@ -1068,6 +1106,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs)
{
+ struct iommu_domain *domain = iommu_get_dma_domain(dev);
+ struct iommu_dma_cookie *cookie = domain->iova_cookie;
+ struct iova_domain *iovad = &cookie->iovad;
dma_addr_t start, end;
struct scatterlist *tmp;
int i;
@@ -1080,6 +1121,17 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
return;
}
+ /*
+ * If the IOMMU pagesize is larger than the CPU pagesize we mapped
+ * every entry indepedently with __iommu_dma_map then. Let's do the
+ * opposite here.
+ */
+ if (iovad->granule > PAGE_SIZE) {
+ for_each_sg(sg, tmp, nents, i)
+ __iommu_dma_unmap(dev, sg_dma_address(tmp), sg_dma_len(tmp));
+ return;
+ }

As above, this is just __iommu_dma_unmap_sg_swiotlb() with fewer clothes on.

+
/*
* The scatterlist segments are mapped into a single
* contiguous IOVA allocation, so this is incredibly easy.
@@ -1110,7 +1162,7 @@ static void iommu_dma_unmap_resource(struct device *dev, dma_addr_t handle,
static void __iommu_dma_free(struct device *dev, size_t size, void *cpu_addr)
{
- size_t alloc_size = PAGE_ALIGN(size);
+ size_t alloc_size = iommu_page_align(dev, size);
int count = alloc_size >> PAGE_SHIFT;
struct page *page = NULL, **pages = NULL;
@@ -1150,7 +1202,7 @@ static void *iommu_dma_alloc_pages(struct device *dev, size_t size,
struct page **pagep, gfp_t gfp, unsigned long attrs)
{
bool coherent = dev_is_dma_coherent(dev);
- size_t alloc_size = PAGE_ALIGN(size);
+ size_t alloc_size = iommu_page_align(dev, size);
int node = dev_to_node(dev);
struct page *page = NULL;
void *cpu_addr;
@@ -1201,8 +1253,8 @@ static void *iommu_dma_alloc(struct device *dev, size_t size,
if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
!gfpflags_allow_blocking(gfp) && !coherent)
- page = dma_alloc_from_pool(dev, PAGE_ALIGN(size), &cpu_addr,
- gfp, NULL);
+ page = dma_alloc_from_pool(dev, iommu_page_align(dev, size),
+ &cpu_addr, gfp, NULL);
else
cpu_addr = iommu_dma_alloc_pages(dev, size, &page, gfp, attrs);
if (!cpu_addr)
@@ -1253,6 +1305,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
void *cpu_addr, dma_addr_t dma_addr, size_t size,
unsigned long attrs)

Can we just not bother trying to support this? TBH I don't know exactly
how the interface is supposed to work - what you're doing here looks
like it's probably either too much or not enough, depending on whether
the address and size arguments are supposed to allow representing
partial buffers - and frankly I can't imagine you'll be needing to
support dma-buf exports from the USB/ethernet/wifi drivers any time soon...

I'm not really sure how this is going to work even before my changes.
Just from reading the code it looks like even then it might be doing too much
or too little.
But this will also be used for Thunderbolt and who knows what kind of devices
will be connected there. I'm fine with just not supporting this interface unless
something actually breaks for some user though.

I would bet that the set of random Thunderbolt-attachable devices which participate in dma-buf exports and the set of media devices which expect vb2_dma_contig to work for arbitrary user buffers have a non-empty intersection. If you eat your cake you may subsequently discover that you no longer have your cake ;)

If we can't easily test it or reason about it, let's not pretend to implement it. I'd rather somebody discovered the lack of working support in a few years' time because it reliably and safely returned an error, rather than because it ate their page cache. Besides, it's not like people can't use a kernel built with the right PAGE_SIZE (which might perform better anyway) and not have the problem in the first place.

{
+ size_t aligned_size = iommu_page_align(dev, size);
struct page *page;
int ret;
@@ -1261,7 +1314,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
if (pages) {
return sg_alloc_table_from_pages(sgt, pages,
- PAGE_ALIGN(size) >> PAGE_SHIFT,
+ aligned_size >> PAGE_SHIFT,
0, size, GFP_KERNEL);
}
@@ -1272,7 +1325,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
ret = sg_alloc_table(sgt, 1, GFP_KERNEL);
if (!ret)
- sg_set_page(sgt->sgl, page, PAGE_ALIGN(size), 0);
+ sg_set_page(sgt->sgl, page, aligned_size, 0);
return ret;
}
@@ -1283,11 +1336,25 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev)
return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
}
+static struct page *iommu_dma_alloc_aligned_pages(struct device *dev, size_t size,
+ dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
+{
+ size = iommu_page_align(dev, size);
+ return dma_common_alloc_pages(dev, size, dma_handle, dir, gfp);
+}
+
+static void iommu_dma_free_aligned_pages(struct device *dev, size_t size, struct page *page,
+ dma_addr_t dma_handle, enum dma_data_direction dir)
+{
+ size = iommu_page_align(dev, size);
+ return dma_common_free_pages(dev, size, page, dma_handle, dir);
+}

Again, what's the point of these? iommu_dma_map_page() still has to cope
with whatever the caller provides, so there's no difference in the one
case when that caller happens to be dma_common_map_pages().

Same as above, untrusted devices.

Again fair enough, but in that case do it for untrusted devices. Not for the whole world for most of whom it still *is* a needless waste.

Robin.