RE: [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling

From: Michael Kelley
Date: Fri Feb 23 2024 - 12:05:15 EST


From: Will Deacon <will@xxxxxxxxxx> Sent: Friday, February 23, 2024 4:48 AM
> On Wed, Feb 21, 2024 at 11:35:44PM +0000, Michael Kelley wrote:
> > From: Will Deacon <will@xxxxxxxxxx> Sent: Wednesday, February 21, 2024 3:35 AM

[snip]

> > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > > index b079a9a8e087..2ec2cc81f1a2 100644
> > > --- a/kernel/dma/swiotlb.c
> > > +++ b/kernel/dma/swiotlb.c
> > > @@ -982,7 +982,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > > phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
> > > unsigned long max_slots = get_max_slots(boundary_mask);
> > > unsigned int iotlb_align_mask =
> > > - dma_get_min_align_mask(dev) | alloc_align_mask;
> > > + dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
> > > unsigned int nslots = nr_slots(alloc_size), stride;
> > > unsigned int offset = swiotlb_align_offset(dev, orig_addr);
> > > unsigned int index, slots_checked, count = 0, i;
> > > @@ -993,19 +993,18 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > > BUG_ON(!nslots);
> > > BUG_ON(area_index >= pool->nareas);
> > >
> > > + /*
> > > + * For mappings with an alignment requirement don't bother looping to
> > > + * unaligned slots once we found an aligned one.
> > > + */
> > > + stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask));
> > > +
> > > /*
> > > * For allocations of PAGE_SIZE or larger only look for page aligned
> > > * allocations.
> > > */
> > > if (alloc_size >= PAGE_SIZE)
> > > - iotlb_align_mask |= ~PAGE_MASK;
> > > - iotlb_align_mask &= ~(IO_TLB_SIZE - 1);
> > > -
> > > - /*
> > > - * For mappings with an alignment requirement don't bother looping to
> > > - * unaligned slots once we found an aligned one.
> > > - */
> > > - stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1;
> > > + stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);
> >
> > Is this special handling of alloc_size >= PAGE_SIZE really needed?
>
> I've been wondering that as well, but please note that this code (and the
> comment) are in the upstream code, so I was erring in favour of keeping
> that while fixing the bugs. We could have an extra patch dropping it if
> we can convince ourselves that it's not adding anything, though.
>
> > I think the comment is somewhat inaccurate. If orig_addr is non-zero, and
> > alloc_align_mask is zero, the requirement is for the alignment to match
> > the DMA min_align_mask bits in orig_addr, even if the allocation is
> > larger than a page. And with Patch 3 of this series, the swiotlb_alloc()
> > case passes in alloc_align_mask to handle page size and larger requests.
> > So it seems like this doesn't do anything useful unless orig_addr and
> > alloc_align_mask are both zero, and there aren't any cases of that
> > after this patch series. If the caller wants alignment, specify
> > it with alloc_align_mask.
>
> It's an interesting observation. Presumably the intention here is to
> reduce the cost of the linear search, but the code originates from a
> time when we didn't have iotlb_align_mask or alloc_align_mask and so I
> tend to agree that it should probably just be dropped. I'm also not even
> convinced that it works properly if the initial search index ends up
> being 2KiB (i.e. slot) aligned -- we'll end up jumping over the
> page-aligned addresses!
>
> I'll add another patch to v5 which removes this check (and you've basically
> written the commit message for me, so thanks).

Works for me.

>
> > > spin_lock_irqsave(&area->lock, flags);
> > > if (unlikely(nslots > pool->area_nslabs - area->used))
> > > @@ -1015,11 +1014,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > > index = area->index;
> > >
> > > for (slots_checked = 0; slots_checked < pool->area_nslabs; ) {
> > > - slot_index = slot_base + index;
> > > + phys_addr_t tlb_addr;
> > >
> > > - if (orig_addr &&
> > > - (slot_addr(tbl_dma_addr, slot_index) &
> > > - iotlb_align_mask) != (orig_addr & iotlb_align_mask)) {
> > > + slot_index = slot_base + index;
> > > + tlb_addr = slot_addr(tbl_dma_addr, slot_index);
> > > +
> > > + if ((tlb_addr & alloc_align_mask) ||
> > > + (orig_addr && (tlb_addr & iotlb_align_mask) !=
> > > + (orig_addr & iotlb_align_mask))) {
> >
> > It looks like these changes will cause a mapping failure in some
> > iommu_dma_map_page() cases that previously didn't fail.
>
> Hmm, it's really hard to tell. This code has been quite badly broken for
> some time, so I'm not sure how far back you have to go to find a kernel
> that would work properly (e.g. for Nicolin's case with 64KiB pages).
>
> > Everything is made right by Patch 4 of your series, but from a
> > bisect standpoint, there will be a gap where things are worse.
> > In [1], I think Nicolin reported a crash with just this patch applied.
>
> In Nicolin's case, I think it didn't work without the patch either, this
> just triggered the failure earlier.
>
> > While the iommu_dma_map_page() case can already fail due to
> > "too large" requests because of not setting a max mapping size,
> > this patch can cause smaller requests to fail as well until Patch 4
> > gets applied. That might be problem to avoid, perhaps by
> > merging the Patch 4 changes into this patch.
>
> I'll leave this up to Christoph. Personally, I'm keen to avoid having
> a giant patch trying to fix all the SWIOTLB allocation issues in one go,
> as it will inevitably get reverted due to a corner case that we weren't
> able to test properly, breaking the common cases at the same time.
>

Yes, I agree there's a tradeoff against cramming all the changes into
one big patch, so I'm OK with whichever approach is taken.

FWIW, here is the case I'm concerned about being broken after this
patch, but before Patch 4 of the series:

* alloc_align_mask is 0xFFFF (e.g., due to 64K IOMMU granule)
* iotlb_align_mask is 0x800 (DMA min_align_mask is 4K - 1, as for NVMe)
* orig_addr is non-NULL and has bit 0x800 set

In the new "if" statement, any tlb_addr that produces "false" for
the left half of the "||" operator produces "true" for the right half.
So the entire "if" statement always evaluates to true and the
"for" loop never finds any slots that can be used. In other words,
for this case there's no way for the returned swiotlb memory to be
aligned to alloc_align_mask and to orig_addr (modulo DMA
min_align_mask) at the same time, and the mapping fails.

Michael