Re: [PATCH V1] vfio: add attach_group_by_node to control behavior of attaching group to domain

From: Zhenguo Yao
Date: Tue Nov 21 2023 - 20:02:33 EST


Understood, thanks!

Alex Williamson <alex.williamson@xxxxxxxxxx> 于2023年11月18日周六 06:27写道:
>
> On Wed, 15 Nov 2023 10:02:09 +0800
> yaozhenguo <yaozhenguo1@xxxxxxxxx> wrote:
>
> > From: Zhenguo Yao <yaozhenguo1@xxxxxxxxx>
> >
> > Groups will attach to one iommu_domain if ops and enforce_cache_coherency
> > are equal. And all the iommu hardware share one pagetable by default.
> > There are performance issue in some scenarios. For example:
> > Host hardware topopy:
> >
> > node0 + PCIe RP0 ---+ GPU A100
> > | |---+ GPU A100
> > | |---+ NIC Mellanox CX6
> > | |---+ NIC Mellanox CX6
> > + PCIe RP1 ---+ GPU A100
> > |---+ GPU A100
> > |---+ NIC Mellanox CX6
> > |---+ NIC Mellanox CX6
> > node1 + PCIe RP0 ---+ GPU A100
> > | |---+ GPU A100
> > | |---+ NIC Mellanox CX6
> > | |---+ NIC Mellanox CX6
> > + PCIe RP1 ---+ GPU A100
> > |---+ GPU A100
> > |---+ NIC Mellanox CX6
> > |---+ NIC Mellanox CX6
> >
> > We passthrough all NICs and GPU to VM, and emulate host hardware topopy.
> > Mellanox CX6 ATS feature is enabled, GPU direct RDMA enabled.
> > We test NCCL allreduce in VM at different cases.
> >
> > Case1: allreduce test use 4nic and 4GPU in numa0.
> > Case2:allreduce test use 4nic and 4GPU in numa1.
> > case3: allreduce test use 8nic and 8GPU.
> >
> > the result are below:
> >
> > | | algbw (GB/S) |
> > | ------ | -------------|
> > | case1 | 24 |
> > | case2 | 32 |
> > | case3 | 45 |
> >
> > We checked that IOMMU pagetable is allocated in numa1 when VM boot up.
> > So, if IOTLB miss happan, IOMMU hardware in numa0 will access remote
> > pagetable in numa1. This will drop performance. After apply this patch and
> > attach_group_by_node is 1. Group in same node will attach to one domain.
> > IOMMU will access there local pagetable. Performance is improved:
> >
> > | | algbw (GB/S) |
> > | ------ | -------------|
> > | case1 | 32 |
> > | case2 | 32 |
> > | case3 | 63 |
> >
> > Signed-off-by: Zhenguo Yao <yaozhenguo1@xxxxxxxxx>
> > Co-developed-by: Wenchao Yao <yaowenchao@xxxxxx>
> > Signed-off-by: Wenchao Yao <yaowenchao@xxxxxx>
> > Co-developed-by: ZiHan Zhou <zhouzihan30@xxxxxx>
> > Signed-off-by: ZiHan Zhou <zhouzihan30@xxxxxx>
> > ---
> > drivers/iommu/intel/iommu.c | 8 +++++++-
> > drivers/vfio/vfio_iommu_type1.c | 33 +++++++++++++++++++++------------
> > include/linux/iommu.h | 1 +
> > 3 files changed, 29 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 3531b95..2c6d8f0 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -569,8 +569,10 @@ void domain_update_iommu_cap(struct dmar_domain *domain)
> > * If RHSA is missing, we should default to the device numa domain
> > * as fall back.
> > */
> > - if (domain->nid == NUMA_NO_NODE)
> > + if (domain->nid == NUMA_NO_NODE) {
> > domain->nid = domain_update_device_node(domain);
> > + domain->domain.nid = domain->nid;
> > + }
> >
> > /*
> > * First-level translation restricts the input-address to a
> > @@ -1767,6 +1769,7 @@ static struct dmar_domain *alloc_domain(unsigned int type)
> > return NULL;
> >
> > domain->nid = NUMA_NO_NODE;
> > + domain->domain.nid = NUMA_NO_NODE;
> > if (first_level_by_default(type))
> > domain->use_first_level = true;
> > domain->has_iotlb_device = false;
> > @@ -1808,6 +1811,8 @@ int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
> > info->refcnt = 1;
> > info->did = num;
> > info->iommu = iommu;
> > + domain->nid = iommu->node;
> > + domain->domain.nid = iommu->node;
> > curr = xa_cmpxchg(&domain->iommu_array, iommu->seq_id,
> > NULL, info, GFP_ATOMIC);
> > if (curr) {
> > @@ -1837,6 +1842,7 @@ void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
> > clear_bit(info->did, iommu->domain_ids);
> > xa_erase(&domain->iommu_array, iommu->seq_id);
> > domain->nid = NUMA_NO_NODE;
> > + domain->domain.nid = NUMA_NO_NODE;
> > domain_update_iommu_cap(domain);
> > kfree(info);
> > }
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index eacd6ec..6a5641e 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -59,6 +59,11 @@
> > module_param_named(dma_entry_limit, dma_entry_limit, uint, 0644);
> > MODULE_PARM_DESC(dma_entry_limit,
> > "Maximum number of user DMA mappings per container (65535).");
> > +static uint attach_group_by_node;
> > +module_param_named(attach_group_by_node,
> > + attach_group_by_node, uint, 0644);
> > +MODULE_PARM_DESC(attach_group_by_node,
> > + "Attach group to domain when it's in same node");
> >
> > struct vfio_iommu {
> > struct list_head domain_list;
> > @@ -2287,19 +2292,23 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> > if (d->domain->ops == domain->domain->ops &&
> > d->enforce_cache_coherency ==
> > domain->enforce_cache_coherency) {
> > - iommu_detach_group(domain->domain, group->iommu_group);
> > - if (!iommu_attach_group(d->domain,
> > - group->iommu_group)) {
> > - list_add(&group->next, &d->group_list);
> > - iommu_domain_free(domain->domain);
> > - kfree(domain);
> > - goto done;
> > - }
> > + if ((attach_group_by_node == 1 &&
> > + d->domain->nid == domain->domain->nid) ||
> > + attach_group_by_node == 0) {
> > + iommu_detach_group(domain->domain, group->iommu_group);
> > + if (!iommu_attach_group(d->domain,
> > + group->iommu_group)) {
> > + list_add(&group->next, &d->group_list);
> > + iommu_domain_free(domain->domain);
> > + kfree(domain);
> > + goto done;
> > + }
> >
> > - ret = iommu_attach_group(domain->domain,
> > - group->iommu_group);
> > - if (ret)
> > - goto out_domain;
> > + ret = iommu_attach_group(domain->domain,
> > + group->iommu_group);
> > + if (ret)
> > + goto out_domain;
> > + }
> > }
> > }
> >
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index ec289c1..c1330ed 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -123,6 +123,7 @@ struct iommu_domain {
> > int users;
> > };
> > };
> > + int nid;
> > };
> >
> > static inline bool iommu_is_dma_domain(struct iommu_domain *domain)
>
> As I understand what's being done here, we're duplicating
> dmar_domain.nid to iommu_domain.nid, then when enabled by this new
> module option, we'll use this node id as part of the match to determine
> whether to create a new domain within the same container context or
> re-use an existing domain, which may have non-favorably locality.
>
> If we're going to implement a node id on the iommu_domain, it should
> replace the existing use of node id in the device specific structure
> and not simply duplicate it. This should also account for non-VT-d use
> cases as well, for example AMD IOMMU also has a nid field on their
> protection_domain structure. Alternatively this might be implemented
> through iommu_domain_ops so we could query the node association for a
> domain.
>
> I question whether we need this solution at all though. AIUI the
> initial domain is allocated in proximity to the initial group. The
> problem comes when the user asks to add an additional group into the
> same container. Another valid solution would be that the user
> recognizes that these groups are not within the same locality and
> creates a separate container for this group. In fact, if we're using
> QEMU here and created a q35 VM with vIOMMU, each device would have a
> separate address space and therefore a separate container and we'd
> already avoid the issue this patch tries to solve.
>
> Separate containers per QEMU AddressSpace are a requirement, but QEMU
> might also implement a policy to not re-use vfio containers between
> virtual nodes such that if each locality were mapped to separate PXBs
> with unique proximities, then simply reflecting the physical locality
> into the VM would be sufficient to avoid this non-optimal domain
> allocation placement.
>
> In any case, the type1 vfio IOMMU backend is in the early stages of
> deprecation, so any choices we make here would also need to be reflected
> in IOMMUFD, both in the compatibility and native interfaces. Thanks,
>
> Alex
>