Re: Memory corruption with CONFIG_SWIOTLB_DYNAMIC=y

From: Petr Tesařík
Date: Fri Nov 03 2023 - 15:00:10 EST


Hello Niklas,

thank you for your report. This is some great analysis!

On Fri, 03 Nov 2023 16:13:03 +0100
Niklas Schnelle <schnelle@xxxxxxxxxxxxx> wrote:

> Hi Swiotlb Maintainers,
>
> With s390 slated to use dma-iommu.c I was experimenting with
> CONFIG_SWIOTLB_DYNAMIC but was getting errors all over the place.
> Debugging this I've managed to narrow things down to what I believe is
> a memory corruption issue caused by overrunning the entire transient
> memory pool allocated by swiotlb. In my test this happens on the
> iommu_dma_map_page(), swiotlb_tbl_map_single() path when handling
> untrusted PCI devices.
>
> I've seen this happen only for transient pools when:
> * allocation size >=PAGE_SIZE and,
> * the original address of the mapping is not page aligned.
>
> The overrun can be seen clearly by adding a simple "tlb_addr +
> alloc_size > pool->end' overrun check to swiotlb_tbl_map_single() and
> forcing PCI test devices to be untrusted. With that in place while an
> NVMe fio work load runs fine TCP/IP tests on a Mellanox ConnectX-4 VF
> reliably trigger the overrun check and with a debug print produce
> output like the following:
>
> software IO TLB:
> transient:1
> index:1
> dma_get_min_align_mask(dev):0
> orig_addr:ac0caebe
> tlb_addr=a4d0f800
> start:a4d0f000
> end:a4d10000
> pool_size:4096
> alloc_size:4096
> offset:0
>
> The complete code used for this test is available on my
> git.kernel.org[0] repository but it's bascially v6.6 + iommu/next (for
> s390 DMA API) + 2 trivial debug commits.
>
> For further analysis I've worked closely with Halil Pasic.
>
> The reason why we think this is happening seems twofold. Under a
> certain set of circumstances in the function swiotlb_find_slots():
> 1) the allocated transient pool can not fit the required bounce buffer,

I am aware that this can happen. It's in fact why the index returned by
swiotlb_search_pool_area() is checked in swiotlb_find_slots().

> and
> 2) the invocation of swiotlb_pool_find_slots() finds "a suitable
> slot" even though it should fail.

This needs fixing.

> The reason for 2), i.e. swiotlb_pool_find_slots() thinking there is a
> suitable bounce buffer in the pool is that for transient pools pool-
> >slots[i].list end up nonsensical, because swiotlb_init_io_tlb_pool(),
> among other places in swiotlb, assumes that the nslabs of the pool is a
> multiple of IO_TLB_SEGSIZE

Ah... Yes, the transient pool size must also be a multiple of segment
size, but it is not enforced.

This reminds me of my debugging session that resulted in commit
aabd12609f91 ("swiotlb: always set the number of areas before
allocating the pool") and this conversation:

https://lore.kernel.org/linux-iommu/20230629074403.7f8ed9d6@xxxxxxxxxxxxxxxxxxxx/

> The reason for 1) is a bit more convoluted and not entirely understood
> by us. We are certain though that the function swiotlb_find_slots()
> allocates a pool with nr_slots(alloc_size), where this alloc_size is
> the alloc_size from swiotlb_tbl_map_single() + swiotlb_align_offset(),
> but for alignment reasons some slots may get "skipped over" in
> swiotlb_area_find_slots() causing the picked slots to overrun the
> allocation.

Yes, this can happen.

> Not sure how to properly fix this as the different alignment
> requirements get pretty complex quickly. So would appreciate your
> input.

I don't think it's possible to improve the allocation logic without
modifying the page allocator and/or the DMA atomic pool allocator to
take additional constraints into account.

I had a wild idea back in March, but it would require some intrusive
changes in the mm subsystem. Among other things, it would make memory
zones obsolete. I mean, people may actually like to get rid of DMA,
DMA32 and NORMAL, but you see how many nasty bugs were introduced even
by a relatively small change in SWIOTLB. Replacing memory zones with a
system based on generic physical allocation constraints would probably
blow up the universe. ;-)

Petr T