Re: [PATCH -next] iommu/dma: avoid expensive indirect calls for sync operations

From: Eric Dumazet
Date: Tue Nov 22 2022 - 17:55:51 EST


On Tue, Nov 22, 2022 at 11:18 AM Michal Kubiak <michal.kubiak@xxxxxxxxx> wrote:
>
> On Sat, Nov 12, 2022 at 04:04:52AM +0000, Eric Dumazet wrote:
> > Quite often, NIC devices do not need dma_sync operations
> > on x86_64 at least.
> >
> > Indeed, when dev_is_dma_coherent(dev) is true and
> > dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
> > and friends do nothing.
> >
> > However, indirectly calling them when CONFIG_RETPOLINE=y
> > consumes about 10% of cycles on a cpu receiving packets
> > from softirq at ~100Gbit rate, as shown in [1]
> >
> > Even if/when CONFIG_RETPOLINE is not set, there
> > is a cost of about 3%.
> >
> > This patch adds a copy of iommu_dma_ops structure,
> > where sync_single_for_cpu, sync_single_for_device,
> > sync_sg_for_cpu and sync_sg_for_device are unset.
>
>
> Larysa from our team has found out this patch introduces also a
> functional improvement for batch allocation in AF_XDP while iommmu is
> turned on.
> In 'xp_alloc_batch()' function there is a check if DMA needs a
> synchronization. If so, batch allocation is not supported and we can
> allocate only one buffer at a time.
>
> The flag 'dma_need_sync' is being set according to the value returned by
> the function 'dma_need_sync()' (from '/kernel/dma/mapping.c').
> That function only checks if at least one of two DMA ops is defined:
> 'ops->sync_single_for_cpu' or 'ops->sync_single_for_device'.
>
> > +static const struct dma_map_ops iommu_nosync_dma_ops = {
> > + iommu_dma_ops_common_fields
> > +
> > + .sync_single_for_cpu = NULL,
> > + .sync_single_for_device = NULL,
> > + .sync_sg_for_cpu = NULL,
> > + .sync_sg_for_device = NULL,
> > +};
> > +#undef iommu_dma_ops_common_fields
> > +
> > /*
> > * The IOMMU core code allocates the default DMA domain, which the underlying
> > * IOMMU driver needs to support via the dma-iommu layer.
> > @@ -1586,7 +1612,8 @@ void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit)
> > if (iommu_is_dma_domain(domain)) {
> > if (iommu_dma_init_domain(domain, dma_base, dma_limit, dev))
> > goto out_err;
> > - dev->dma_ops = &iommu_dma_ops;
> > + dev->dma_ops = dev_is_dma_sync_needed(dev) ?
> > + &iommu_dma_ops : &iommu_nosync_dma_ops;
> > }
> >
> > return;
>
> This code removes defining 'sync_*' DMA ops if they are not actually
> used. Thanks to that improvement the function 'dma_need_sync()' will
> always return more meaningful information if any DMA synchronization is
> actually needed for iommu.
>
> Together with Larysa we have applied that patch and we can confirm it
> helps for batch buffer allocation in AF_XDP ('xsk_buff_alloc_batch()'
> call) when iommu is enabled.

Thanks for testing !

I am quite busy relocating, I will address Christoph feedback next week.