Re: [PATCH v4] iommu: Optimise PCI SAC address trick

From: John Garry
Date: Thu Jun 15 2023 - 08:19:00 EST


On 15/06/2023 12:41, Robin Murphy wrote:

Sure, not the same problem.

However when we switched storage drivers to use dma_opt_mapping_size() then performance is similar to iommu.forcedac=1 - that's what I found, anyway.

This tells me that that even though IOVA allocator performance is poor when the 32b space fills, it was those large IOVAs which don't fit in the rcache which were the major contributor to hogging the CPU in the allocator.

The root cause is that every time the last usable 32-bit IOVA is allocated, the *next* PCI caller to hit the rbtree for a SAC allocation is burdened with walking the whole 32-bit subtree to determine that it's full again and re-set max32_alloc_size. That's the overhead that forcedac avoids.


Sure

In the storage case with larger buffers, dma_opt_mapping_size() also means you spend less time in the rbtree, but because you're inherently hitting it less often at all, since most allocations can now hopefully be fulfilled by the caches.

Sure

That's obviously moot when the mappings are already small enough to be cached and the only reason for hitting the rbtree is overflow/underflow in the depot because the working set is sufficiently large and the allocation pattern sufficiently "bursty".

After a bit of checking, this is same issue https://lore.kernel.org/linux-iommu/20230329181407.3eed7378@xxxxxxxxxx/, and indeed we would always be using rcache'able-sized mappings.

So you think that we are reaching the depot full issue when we start to free depot magazines in __iova_rcache_insert(), right? From my experience in storage testing, it takes a long time to get to this state.

Thanks,
John