Re: [PATCH 1/2] swiotlb: Fix allocation alignment requirement when searching slots

From: Petr Tesařík
Date: Fri Jan 26 2024 - 12:01:38 EST


On Fri, 26 Jan 2024 15:19:55 +0000
Will Deacon <will@xxxxxxxxxx> wrote:

> Commit bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix"),
> which was a fix for commit 0eee5ae10256 ("swiotlb: fix slot alignment
> checks"), causes a functional regression with vsock in a virtual machine
> using bouncing via a restricted DMA SWIOTLB pool.
>
> When virtio allocates the virtqueues for the vsock device using
> dma_alloc_coherent(), the SWIOTLB search fails to take into account the
> 8KiB buffer size and returns page-unaligned allocations if 'area->index'
> was left unaligned by a previous allocation from the buffer:
>
> # Final address in brackets is the SWIOTLB address returned to the caller
> | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800)
> | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800)
> | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800)
>
> This ends in tears (typically buffer corruption and/or a hang) because
> swiotlb_alloc() blindly returns the 'struct page' corresponding to the
> allocation and therefore the first half of the page ends up being
> allocated twice.
>
> Fix the problem by treating the allocation alignment separately to any
> additional alignment requirements from the device, using the maximum
> of the two as the stride to search the buffer slots.
>
> Fixes: bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix")
> Fixes: 0eee5ae10256 ("swiotlb: fix slot alignment checks")
> Cc: Christoph Hellwig <hch@xxxxxx>
> Cc: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx>
> Cc: Robin Murphy <robin.murphy@xxxxxxx>
> Cc: Petr Tesarik <petr.tesarik1@xxxxxxxxxxxxxxxxxxx>
> Cc: Dexuan Cui <decui@xxxxxxxxxxxxx>
> Signed-off-by: Will Deacon <will@xxxxxxxxxx>
> ---
> kernel/dma/swiotlb.c | 26 ++++++++++++++++----------
> 1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index b079a9a8e087..25febb9e670c 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -982,7 +982,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
> unsigned long max_slots = get_max_slots(boundary_mask);
> unsigned int iotlb_align_mask =
> - dma_get_min_align_mask(dev) | alloc_align_mask;
> + dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);

Good. So, iotlb_align_mask now specifies how many low bits of orig_addr
should be preserved in the bounce buffer address, ignoring the offset
within the TLB slot...

> unsigned int nslots = nr_slots(alloc_size), stride;
> unsigned int offset = swiotlb_align_offset(dev, orig_addr);
> unsigned int index, slots_checked, count = 0, i;
> @@ -998,14 +998,13 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> * allocations.
> */
> if (alloc_size >= PAGE_SIZE)
> - iotlb_align_mask |= ~PAGE_MASK;
> - iotlb_align_mask &= ~(IO_TLB_SIZE - 1);
> + alloc_align_mask |= ~PAGE_MASK;

..and alloc_align_mask specifies the desired TLB slot alignment.

>
> /*
> * For mappings with an alignment requirement don't bother looping to
> * unaligned slots once we found an aligned one.
> */
> - stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1;
> + stride = (max(alloc_align_mask, iotlb_align_mask) >> IO_TLB_SHIFT) + 1;

I'm not quite sure about this one.

And I'm not even sure all combinations make sense!

For example, take these values:

* TLB_SIZE == 0x800 (2K)
* alloc_align_mask == 0xffffffffffffc000 (16K alignment, could be page size)
* iotlb_align_mask == 0xffffffffffff0000 (64K alignment)
* orig_addr == 0x0000000000001234

Only the lowest 16 bits are relevant for the alignment check.
Device alignment requires 0x1000.
Alloc alignment requires one of 0x0000, 0x4000, 0x8000, or 0xc000.
Obviously, such allocation must always fail...

>
> spin_lock_irqsave(&area->lock, flags);
> if (unlikely(nslots > pool->area_nslabs - area->used))
> @@ -1015,15 +1014,18 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> index = area->index;
>
> for (slots_checked = 0; slots_checked < pool->area_nslabs; ) {
> + phys_addr_t tlb_addr;
> +
> slot_index = slot_base + index;
> + tlb_addr = slot_addr(tbl_dma_addr, slot_index);
> +
> + if (tlb_addr & alloc_align_mask)
> + goto next_slot;

Awww, come on. So your code jumps to a label and then inserts an
unconditional continue just before that label? I'm sure we'll find a
cleaner way to convey the loop logic. What about this:

if ((tlb_addr & alloc_align_mask) != 0 ||
(orig_addr && (tlb_addr & io_tlb_align_mask !=
orig_addr & iotlb_align_mask))) {
index = wrap_area_index(pool, index + 1);
slots_checked++;
continue;
}

But yes, this patch looks like it should finally do the right thing.

Petr T

>
> if (orig_addr &&
> - (slot_addr(tbl_dma_addr, slot_index) &
> - iotlb_align_mask) != (orig_addr & iotlb_align_mask)) {
> - index = wrap_area_index(pool, index + 1);
> - slots_checked++;
> - continue;
> - }
> + (tlb_addr & iotlb_align_mask) !=
> + (orig_addr & iotlb_align_mask))
> + goto next_slot;
>
> if (!iommu_is_span_boundary(slot_index, nslots,
> nr_slots(tbl_dma_addr),
> @@ -1033,6 +1035,10 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> }
> index = wrap_area_index(pool, index + stride);
> slots_checked += stride;
> + continue;
> +next_slot:
> + index = wrap_area_index(pool, index + 1);
> + slots_checked++;
> }
>
> not_found: