Re: [PATCH v9 5/6] iommu/dma: Allow a single FQ in addition to per-CPU FQs

From: Niklas Schnelle
Date: Tue May 23 2023 - 08:09:55 EST


On Mon, 2023-05-22 at 17:26 +0100, Robin Murphy wrote:
> On 2023-05-15 10:15, Niklas Schnelle wrote:
> > In some virtualized environments, including s390 paged memory guests,
> > IOTLB flushes are used to update IOMMU shadow tables. Due to this, they
> > are much more expensive than in typical bare metal environments or
> > non-paged s390 guests. In addition they may parallelize more poorly in
> > virtualized environments. This changes the trade off for flushing IOVAs
> > such that minimizing the number of IOTLB flushes trumps any benefit of
> > cheaper queuing operations or increased paralellism.
> >
> > In this scenario per-CPU flush queues pose several problems. Firstly
> > per-CPU memory is often quite limited prohibiting larger queues.
> > Secondly collecting IOVAs per-CPU but flushing via a global timeout
> > reduces the number of IOVAs flushed for each timeout especially on s390
> > where PCI interrupts may not be bound to a specific CPU.
> >
> > Let's introduce a single flush queue mode that reuses the same queue
> > logic but only allocates a single global queue. This mode can be
> > selected as a flag bit in a new dma_iommu_options struct which can be
> > modified from its defaults by IOMMU drivers implementing a new
> > ops.tune_dma_iommu() callback. As a first user the s390 IOMMU driver
> > selects the single queue mode if IOTLB flushes are needed on map which
> > indicates shadow table use. With the unchanged small FQ size and
> > timeouts this setting is worse than per-CPU queues but a follow up patch
> > will make the FQ size and timeout variable. Together this allows the
> > common IOVA flushing code to more closely resemble the global flush
> > behavior used on s390's previous internal DMA API implementation.
> >
> > Link: https://lore.kernel.org/linux-iommu/3e402947-61f9-b7e8-1414-fde006257b6f@xxxxxxx/
> > Reviewed-by: Matthew Rosato <mjrosato@xxxxxxxxxxxxx> #s390
> > Signed-off-by: Niklas Schnelle <schnelle@xxxxxxxxxxxxx>
> > ---
> > drivers/iommu/dma-iommu.c | 163 ++++++++++++++++++++++++++++++++++-----------
> > drivers/iommu/dma-iommu.h | 4 +-
> > drivers/iommu/iommu.c | 18 +++--
> > drivers/iommu/s390-iommu.c | 10 +++
> > include/linux/iommu.h | 21 ++++++
> > 5 files changed, 169 insertions(+), 47 deletions(-)
> >
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index 7a9f0b0bddbd..be4cab6b4fe4 100644
> > --- a/drivers/iommu/dma-iommu.c
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -49,8 +49,11 @@ struct iommu_dma_cookie {
> > /* Full allocator for IOMMU_DMA_IOVA_COOKIE */
> > struct {
> > struct iova_domain iovad;
> > -
> > - struct iova_fq __percpu *fq; /* Flush queue */
> > + /* Flush queue */
> > + union {
> > + struct iova_fq *single_fq;
> > + struct iova_fq __percpu *percpu_fq;
> > + };
> > /* Number of TLB flushes that have been started */
> > atomic64_t fq_flush_start_cnt;
> > /* Number of TLB flushes that have been finished */
> > @@ -67,6 +70,8 @@ struct iommu_dma_cookie {
> >
> > /* Domain for flush queue callback; NULL if flush queue not in use */
> > struct iommu_domain *fq_domain;
> > + /* Options for dma-iommu use */
> > + struct dma_iommu_options options;
> > struct mutex mutex;
> > };
> >
> > @@ -152,25 +157,44 @@ static void fq_flush_iotlb(struct iommu_dma_cookie *cookie)
> > atomic64_inc(&cookie->fq_flush_finish_cnt);
> > }
> >
> > -static void fq_flush_timeout(struct timer_list *t)
> > +static void fq_flush_percpu(struct iommu_dma_cookie *cookie)
> > {
> > - struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer);
> > int cpu;
> >
> > - atomic_set(&cookie->fq_timer_on, 0);
> > - fq_flush_iotlb(cookie);
> > -
> > for_each_possible_cpu(cpu) {
> > unsigned long flags;
> > struct iova_fq *fq;
> >
> > - fq = per_cpu_ptr(cookie->fq, cpu);
> > + fq = per_cpu_ptr(cookie->percpu_fq, cpu);
> > spin_lock_irqsave(&fq->lock, flags);
> > fq_ring_free(cookie, fq);
> > spin_unlock_irqrestore(&fq->lock, flags);
> > }
> > }
> >
> > +static void fq_flush_single(struct iommu_dma_cookie *cookie)
> > +{
> > + struct iova_fq *fq = cookie->single_fq;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&fq->lock, flags);
> > + fq_ring_free(cookie, fq);
> > + spin_unlock_irqrestore(&fq->lock, flags)
>
> Nit: this should clearly just be a self-locked version of fq_ring_free()
> that takes fq as an argument, then both the new case and the existing
> loop body become trivial one-line calls.

Sure will do. Just one question about names. As an example
pci_reset_function_locked() means that the relevant lock is already
taken with pci_reset_function() adding the lock/unlock. In your wording
the implied function names sound the other way around. I can't find
anything similar in drivers/iommu so would you mind going the PCI way
and having:

fq_ring_free_locked(): Called in queue_iova() with the lock held
fr_ring_free(): Called in fq_flush_timeout() takes the lock itself

Or maybe I'm just biased because I've used the PCI ..locked() functions
before and there is a better convention.

>
> > +}
> > +
> > +static void fq_flush_timeout(struct timer_list *t)
> > +{
> > + struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer);
> > +
> > + atomic_set(&cookie->fq_timer_on, 0);
> > + fq_flush_iotlb(cookie);
> > +
> > + if (cookie->options.flags & IOMMU_DMA_OPTS_SINGLE_QUEUE)
> > + fq_flush_single(cookie);
> > + else
> > + fq_flush_percpu(cookie);
> > +}
> > +
> > static void queue_iova(struct iommu_dma_cookie *cookie,
> > unsigned long pfn, unsigned long pages,
> > struct list_head *freelist)
> > @@ -188,7 +212,11 @@ static void queue_iova(struct iommu_dma_cookie *cookie,
> > */
> > smp_mb();
> >
> > - fq = raw_cpu_ptr(cookie->fq);
> > + if (cookie->options.flags & IOMMU_DMA_OPTS_SINGLE_QUEUE)
> > + fq = cookie->single_fq;
> > + else
> > + fq = raw_cpu_ptr(cookie->percpu_fq);
> > +
> > spin_lock_irqsave(&fq->lock, flags);
> >
> > /*
> > @@ -219,58 +247,114 @@ static void queue_iova(struct iommu_dma_cookie *cookie,
> > jiffies + msecs_to_jiffies(IOVA_FQ_TIMEOUT));
> > }
> >
> > -static void iommu_dma_free_fq(struct iommu_dma_cookie *cookie)
> > +static void iommu_dma_free_fq_single(struct iova_fq *fq)
> > +{
> > + int idx;
> > +
> > + if (!fq)
> > + return;
> > + fq_ring_for_each(idx, fq)
> > + put_pages_list(&fq->entries[idx].freelist);
> > + vfree(fq);
> > +}
> > +
> > +static void iommu_dma_free_fq_percpu(struct iova_fq __percpu *percpu_fq)
> > {
> > int cpu, idx;
> >
> > - if (!cookie->fq)
> > - return;
> > -
> > - del_timer_sync(&cookie->fq_timer);
> > /* The IOVAs will be torn down separately, so just free our queued pages */
> > for_each_possible_cpu(cpu) {
> > - struct iova_fq *fq = per_cpu_ptr(cookie->fq, cpu);
> > + struct iova_fq *fq = per_cpu_ptr(percpu_fq, cpu);
> >
> > fq_ring_for_each(idx, fq)
> > put_pages_list(&fq->entries[idx].freelist);
> > }
> >
> > - free_percpu(cookie->fq);
> > + free_percpu(percpu_fq);
> > +}
> > +
> > +static void iommu_dma_free_fq(struct iommu_dma_cookie *cookie)
> > +{
> > + if (!cookie->fq_domain)
> > + return;
> > +
> > + del_timer_sync(&cookie->fq_timer);
> > + if (cookie->options.flags & IOMMU_DMA_OPTS_SINGLE_QUEUE)
> > + iommu_dma_free_fq_single(cookie->single_fq);
> > + else
> > + iommu_dma_free_fq_percpu(cookie->percpu_fq);
> > +}
> > +
> > +
> > +static void iommu_dma_init_one_fq(struct iova_fq *fq)
> > +{
> > + int i;
> > +
> > + fq->head = 0;
> > + fq->tail = 0;
> > +
> > + spin_lock_init(&fq->lock);
> > +
> > + for (i = 0; i < IOVA_FQ_SIZE; i++)
> > + INIT_LIST_HEAD(&fq->entries[i].freelist);
> > +}
> > +
> > +static int iommu_dma_init_fq_single(struct iommu_dma_cookie *cookie)
> > +{
> > + struct iova_fq *queue;
> > +
> > + queue = vzalloc(sizeof(*queue));
> > + if (!queue)
> > + return -ENOMEM;
> > + iommu_dma_init_one_fq(queue);
> > + cookie->single_fq = queue;
> > +
> > + return 0;
> > +}
> > +
> > +static int iommu_dma_init_fq_percpu(struct iommu_dma_cookie *cookie)
> > +{
> > + struct iova_fq __percpu *queue;
> > + int cpu;
> > +
> > + queue = alloc_percpu(struct iova_fq);
> > + if (!queue)
> > + return -ENOMEM;
> > +
> > + for_each_possible_cpu(cpu)
> > + iommu_dma_init_one_fq(per_cpu_ptr(queue, cpu));
> > + cookie->percpu_fq = queue;
> > + return 0;
> > }
> >
> > /* sysfs updates are serialised by the mutex of the group owning @domain */
> > -int iommu_dma_init_fq(struct iommu_domain *domain)
> > +int iommu_dma_init_fq(struct device *dev, struct iommu_domain *domain)
> > {
> > struct iommu_dma_cookie *cookie = domain->iova_cookie;
> > - struct iova_fq __percpu *queue;
> > - int i, cpu;
> > + const struct iommu_ops *ops = dev_iommu_ops(dev);
> > + int rc;
> >
> > if (cookie->fq_domain)
> > return 0;
> >
> > + if (ops->tune_dma_iommu)
> > + ops->tune_dma_iommu(dev, &cookie->options);
> > +
> > atomic64_set(&cookie->fq_flush_start_cnt, 0);
> > atomic64_set(&cookie->fq_flush_finish_cnt, 0);
> >
> > - queue = alloc_percpu(struct iova_fq);
> > - if (!queue) {
> > + if (cookie->options.flags & IOMMU_DMA_OPTS_SINGLE_QUEUE)
> > + rc = iommu_dma_init_fq_single(cookie);
> > + else
> > + rc = iommu_dma_init_fq_percpu(cookie);
> > +
> > + if (rc) {
> > pr_warn("iova flush queue initialization failed\n");
> > - return -ENOMEM;
> > + /* fall back to strict mode */
> > + domain->type = IOMMU_DOMAIN_DMA;
>
> Why move this? It doesn't logically belong to FQ initialisation itself.

Ah yes this is not needed anymore. Previously when I had a new domain
type I think I needed to set domain->type in here and moved the
fallback for consistency. Will remove that change.

>
> > + return rc;
> > }
> >
> > - for_each_possible_cpu(cpu) {
> > - struct iova_fq *fq = per_cpu_ptr(queue, cpu);
> > -
> > - fq->head = 0;
> > - fq->tail = 0;
> > -
> > - spin_lock_init(&fq->lock);
> > -
> > - for (i = 0; i < IOVA_FQ_SIZE; i++)
> > - INIT_LIST_HEAD(&fq->entries[i].freelist);
> > - }
> > -
> > - cookie->fq = queue;
> > -
> > timer_setup(&cookie->fq_timer, fq_flush_timeout, 0);
> > atomic_set(&cookie->fq_timer_on, 0);
> > /*
> >
---8<---
> > static struct iommu_device *s390_iommu_probe_device(struct device *dev)
> > {
> > struct zpci_dev *zdev;
> > @@ -793,6 +802,7 @@ static const struct iommu_ops s390_iommu_ops = {
> > .device_group = generic_device_group,
> > .pgsize_bitmap = SZ_4K,
> > .get_resv_regions = s390_iommu_get_resv_regions,
> > + .tune_dma_iommu = s390_iommu_tune_dma_iommu,
> > .default_domain_ops = &(const struct iommu_domain_ops) {
> > .attach_dev = s390_iommu_attach_device,
> > .map_pages = s390_iommu_map_pages,
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 58891eddc2c4..3649a17256a5 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -219,6 +219,21 @@ struct iommu_iotlb_gather {
> > bool queued;
> > };
> >
> > +/**
> > + * struct dma_iommu_options - Options for dma-iommu
> > + *
> > + * @flags: Flag bits for enabling/disabling dma-iommu settings
> > + *
> > + * This structure is intended to provide IOMMU drivers a way to influence the
> > + * behavior of the dma-iommu DMA API implementation. This allows optimizing for
> > + * example for a virtualized environment with slow IOTLB flushes.
> > + */
> > +struct dma_iommu_options {
> > +#define IOMMU_DMA_OPTS_PER_CPU_QUEUE (0L << 0)
> > +#define IOMMU_DMA_OPTS_SINGLE_QUEUE (1L << 0)
> > + u64 flags;
> > +};
>
> I think for now this can just use a bit in dev_iommu to indicate that
> the device will prefer a global flush queue; s390 can set that in
> .probe_device, then iommu_dma_init_domain() can propagate it to an
> equivalent flag in the cookie (possibly even a new cookie type?) that
> iommu_dma_init_fq() can then consume. Then just make the s390 parameters
> from patch #6 the standard parameters for a global queue.
>
> Thanks,
> Robin.

Sounds good.

>
> > +
> > /**
> > * struct iommu_ops - iommu ops and capabilities
> > * @capable: check capability
> > @@ -242,6 +257,9 @@ struct iommu_iotlb_gather {
> > * - IOMMU_DOMAIN_IDENTITY: must use an identity domain
> > * - IOMMU_DOMAIN_DMA: must use a dma domain
> > * - 0: use the default setting
> > + * @tune_dma_iommu: Allows the IOMMU driver to modify the default
> > + * options of the dma-iommu layer for a specific
> > + * device.
> > * @default_domain_ops: the default ops for domains
> > * @remove_dev_pasid: Remove any translation configurations of a specific
> > * pasid, so that any DMA transactions with this pasid
> > @@ -278,6 +296,9 @@ struct iommu_ops {
> > int (*def_domain_type)(struct device *dev);
> > void (*remove_dev_pasid)(struct device *dev, ioasid_t pasid);
> >
> > + void (*tune_dma_iommu)(struct device *dev,
> > + struct dma_iommu_options *options);
> > +
> > const struct iommu_domain_ops *default_domain_ops;
> > unsigned long pgsize_bitmap;
> > struct module *owner;
> >
>