Re: [RFC PATCH 6/6] iommu/amd: Introduce nested translation support

From: Jason Gunthorpe
Date: Fri Jan 05 2024 - 09:32:14 EST


On Fri, Jan 05, 2024 at 08:38:47PM +0700, Suthikulpanit, Suravee wrote:
> > > + if (!pdev)
> > > + return -EINVAL;
> > > +
> > > + /* Note: Currently only support GCR3TRPMode with nested translation */
> > > + if (!check_feature2(FEATURE_GCR3TRPMODE))
> > > + return -EOPNOTSUPP;
> > > +
> > > + ret = amd_iommu_set_gcr3tbl_trp(iommu, pdev, hwpt->gcr3, hwpt->glx,
> > > + hwpt->guest_paging_mode);
> >
> > Waah?
> >
> > This is touching the dev table? That is not right, allocation is only
> > *ALLOCATION*. The dev table can't be changed until you do attachment.
>
> My understanding is QEMU should call:
>
> 1. iommufd_backend_get_ioas()
> -> ioctl(IOMMU_IOAS_ALLOC)
>
> 2. For parent domain
> iommufd_backend_alloc_hwpt(IOMMU_HWPT_ALLOC_NEST_PARENT)
> -> ioctl( IOMMU_HWPT_ALLOC)
> --- in Linux ---
> ....
> -> iommufd_hwpt_paging_alloc()
> -> struct iommu_ops.domain_alloc_user(IOMMU_HWPT_ALLOC_NEST_PARENT)
>
> 3. For parent domain
> iommufd_device_attach_hwpt()
> -> vfio_device_attach_container()
> -> ioctl(VFIO_DEVICE_ATTACH_IOMMUFD_PT)
> --- in Linux ---
> vfio_iommufd_physical_attach_ioas()
> -> iommufd_device_attach()
> -> iommufd_device_do_attach()
> -> iommufd_hw_pagetable_attach()
>
> 4. Same as (2) but for child domain w/ stage 1 table.
>
> 5. Same as (3) but for child domain w/ stage 1 table.

Yes, but understand it is not API that the driver can depend on that
order. #3 can be skipped and there can be multiple parents and
everything must still work.

> You want the gcr3 table root point in the DTE to be update at step 5 only,
> right? If so, this should be okay.

#3 sets the DTE to point to just the parent domain as a V1 page table

#4/5 sets the DTE to point to both the V1 page table and the GCR3 table

The DTE is calculated based only on the *currently* attached
iommu_domain's for the struct device.

> > Please look at the smmuv3 patches and try to be structurally
> > similar. AMD and SMMUv3 are *very similar* in how their HW works
> > excluding the viommu stuff.
>
> Could you please point me to the series? I found one but it was really old.
> I might have missed the latest stuff.

https://lore.kernel.org/linux-iommu/0-v1-e289ca9121be+2be-smmuv3_newapi_p1_jgg@xxxxxxxxxx/
https://lore.kernel.org/linux-iommu/0-v2-16665a652079+5947-smmuv3_newapi_p2_jgg@xxxxxxxxxx/

(and the github link has the other parts)

> > You also can't assume your parent is currently attached to anything.
> >
> > The construction of the DTE has to be from-scratch based on the parent
> > domain and the provided values in the "hwpt". Again see how smmuv3
> > does this where there is one function that builds the entire DTE
> > (called STE)
>
> Ok. I'll program fields of the DTE, which are related to DMA-remapping with
> v1 and v2 table using the information in the parent and child domains only.

The smmu code looks like:

static void arm_smmu_make_nested_domain_ste(
struct arm_smmu_ste *target, struct arm_smmu_master *master,
struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
{
[..]
// Incorporate the "v1 fields" into the "DTE"
arm_smmu_make_s2_domain_ste(target, master, nested_domain->s2_parent,
ats_enabled);

// Incorporate the "GCR3 fields" into the "DTE"
target->data[0] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_0_CFG,
STRTAB_STE_0_CFG_NESTED)) |
nested_domain->ste[0];

Where "arm_smmu_ste *" is stack memory that holds the "DTE" being
constructed. Once the make* family of function figure out the exact
STE that the struct device needs, another function writes the STE to
the HW visible location(s).

The goal is to be able to calculate the required DTE for the #3 and
#4/5 cases you list above directly without making assumptions about
the current state of the DTE or anything else.

> A device can be attached to a domain, which could be either one-level or
> nested domain. In case of nesting, the parent domain contains information
> for the stage2 (v1) table, while the child domain contains information for
> the stage1 (v2) table. For each domain, we need to keep track of the parent
> domain.

The nesting domain, and only the nesting domain, stores a pointer to
it's parent domain - they are a unit together.

Jason