Re: [PATCH v4] iommu: Optimise PCI SAC address trick

From: John Garry
Date: Tue Apr 18 2023 - 14:50:35 EST


On 18/04/2023 18:36, Linus Torvalds wrote:
JFYI, Since you are using NVMe, you could also alternatively try
something like which I did for some SCSI storage controller drivers to
limit the request_queue max_sectors soft limit, like:
That patch is not only whitespace-damaged, it's randomly missing one
'+' character

My copy and paste error

so it makes no sense even ignoring the whitespace
problems._and_ it has a nonsensical cast to 'unsigned int' which
makes that 'min()' possibly do crazy and invalid things (ie imagine
dma_opt_mapping_size() returning 4GB).

You can't cast things to the smaller size just to get rid of a
warning, for chrissake!

Yeah, sorry, I was just trying to show a very quick demo of how this can actually be done.

Indeed, I could have mentioned that it would actually have been easier to test by feeding a lower limit into /sys/block/<dev>/queue/max_sectors_kb


In fact, even without the cast, it seems entirely broken, since the
fallback for dma_opt_mapping_size() is to return 0 (admittedly_that_
case only happens with HAS_DMA=n).

Finally, doing this inside the

if (ctrl->max_hw_sectors) {

I think that this would be set for PCI NVMe controllers, which we were interested in here. But, indeed, I could check for a better place to set this.


conditional seems entirely wrong, since any dma mapping limits would
be entirely independent of any driver maximum hw size, and in fact
*easier* to hit if the block device itself doesn't have any max
limits.

So please burn that patch in the darkest pits of hell and let's try to
forget it ever existed. Ok?

Sure


Also, shouldn't any possible dma mapping size affect not
'max_sectors', but 'max_segment_size'? At least the docs imply that
dma_opt_mapping_size() is about the max size of a_single_ mapping,
not of the whole thing?

It's meant to apply to total mapping length and not a single segment, so then the doc would be misleading.


Anyway, if this is actually an issue, to the point that it's now being
discussed for a_second_ block driver subsystem, then shouldn't the
queue handling just do this all automatically, instead of adding
random crap to random block driver architectures?

Other storage controllers may enjoy better performance with very large DMA mappings (whose total length exceed the IOVA caching limit), so it was too risky to apply a performance-related change of this nature across the board when that API was introduced.

So far it had only been a single controller where we were actually seeing the issue of alloc'ing IOVAs giving very (very) poor performance.

However, as far as I am aware, there was nothing special about that controller, apart from the fact that it was often creating requests whose length exceeded that IOVA caching limit, and it also filling the 32b IOVA space quickly - that may be because the system had lots of CPUs.

Since there are now reports of poor performance in other storage controllers and also in networking adapters, I can only assume that people are testing more often with IOMMU-enabled systems with lots of CPUs. Having said that, I would still be cautious of applying that limit everywhere.


And no, I don't know this code, so maybe I'm entirely missing
something, but that patch just raised my hackles enough that I had to
say something.

Sure.

Thanks,
John