Re: [RFC PATCH] kvm: nv: Optimize the unmapping of shadow S2-MMU tables.

From: Marc Zyngier
Date: Tue Mar 05 2024 - 10:03:34 EST


On Tue, 05 Mar 2024 13:29:08 +0000,
Ganapatrao Kulkarni <gankulkarni@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
>
>
> On 05-03-2024 04:43 pm, Marc Zyngier wrote:
> > [re-sending with kvmarm@ fixed]
> >
> > On Tue, 05 Mar 2024 05:46:06 +0000,
> > Ganapatrao Kulkarni <gankulkarni@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> >>
> >> As per 'commit 178a6915434c ("KVM: arm64: nv: Unmap/flush shadow stage 2
> >
> > $ git describe --contains 178a6915434c --match=v\*
> > fatal: cannot describe '178a6915434c141edefd116b8da3d55555ea3e63'
> >
>
> My bad(I would have been more verbose), I missed to mention that this
> patch is on top of NV-V11 patch series.
>
> > This commit simply doesn't exist upstream. It only lives in a
> > now deprecated branch that will never be merged.
> >
> >> page tables")', when ever there is unmap of pages that
> >> are mapped to L1, they are invalidated from both L1 S2-MMU and from
> >> all the active shadow/L2 S2-MMU tables. Since there is no mapping
> >> to invalidate the IPAs of Shadow S2 to a page, there is a complete
> >> S2-MMU page table walk and invalidation is done covering complete
> >> address space allocated to a L2. This has performance impacts and
> >> even soft lockup for NV(L1 and L2) boots with higher number of
> >> CPUs and large Memory.
> >>
> >> Adding a lookup table of mapping of Shadow IPA to Canonical IPA
> >> whenever a page is mapped to any of the L2. While any page is
> >> unmaped, this lookup is helpful to unmap only if it is mapped in
> >> any of the shadow S2-MMU tables. Hence avoids unnecessary long
> >> iterations of S2-MMU table walk-through and invalidation for the
> >> complete address space.
> >
> > All of this falls in the "premature optimisation" bucket. Why should
> > we bother with any of this when not even 'AT S1' works correctly,
>
> Hmm, I am not aware of this, is this something new issue of V11?

it's been there since v0. All we have is a trivial implementation that
doesn't survive the S1 page-tables being swapped out. It requires a
full S1 PTW to be written.

>
> > making it trivial to prevent a guest from making forward progress? You
> > also show no numbers that would hint at a measurable improvement under
> > any particular workload.
>
> This patch is avoiding long iterations of unmap which was resulting in
> soft-lockup, when tried L1 and L2 with 192 cores.
> Fixing soft lockup isn't a required fix for feature enablement?

No. All we care is correctness, not performance. Addressing
soft-lockups is *definitely* a performance issue, which I'm 100% happy
to ignore.

[...]

> >> +static inline bool kvm_is_l1_using_shadow_s2(struct kvm_vcpu
> >> *vcpu)
> >> +{
> >> + return (vcpu->arch.hw_mmu != &vcpu->kvm->arch.mmu);
> >> +}
> >
> > Isn't that the very definition of "!in_hyp_ctxt()"? You are abusing
>
> "!in_hyp_ctxt()" isn't true for non-NV case also?

Surely you don't try to use this in non-NV contexts, right? Why would
you try to populate a shadow reverse-map outside of a NV context?

> This function added to know that L1 is NV enabled and using shadow S2.
>
> > the hw_mmu pointer to derive something, but the source of truth is the
> > translation regime, as defined by HCR_EL2.{E2H,TGE} and PSTATE.M.
> >
>
> OK, I can try HCR_EL2.{E2H,TGE} and PSTATE.M instead of hw_mmu in next
> version.

No. Use is_hyp_ctxt().

[...]

> >> index 61bdd8798f83..3948681426a0 100644
> >> --- a/arch/arm64/kvm/mmu.c
> >> +++ b/arch/arm64/kvm/mmu.c
> >> @@ -1695,6 +1695,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >> memcache,
> >> KVM_PGTABLE_WALK_HANDLE_FAULT |
> >> KVM_PGTABLE_WALK_SHARED);
> >> + if ((nested || kvm_is_l1_using_shadow_s2(vcpu)) && !ret) {
> >
> > I don't understand this condition. If nested is non-NULL, it's because
> > we're using a shadow S2. So why the additional condition?
>
> No, nested is set only for L2, for L1 it is not.
> To handle L1 shadow S2 case, I have added this condition.

But there is *no shadow* for L1 at all. The only way to get a shadow
is to be outside of the EL2(&0) translation regime. El2(&0) itself is
always backed by the canonical S2. By definition, L1 does not run with
a S2 it is in control of. No S2, no shadow.

[...]

> > What guarantees that the mapping you have for L1 has the same starting
> > address as the one you have for L2? L1 could have a 2MB mapping and L2
> > only 4kB *in the middle*.
>
> IIUC, when a page is mapped to 2MB in L1, it won't be
> mapped to L2 and we iterate with the step of PAGE_SIZE and we should
> be hitting the L2's IPA in lookup table, provided the L2 page falls in
> unmap range.

But then how do you handle the reverse (4kB at L1, 2MB at L2)? Without
tracking of the *intersection*, this fails to be correctly detected.
This is TLB matching 101.

[...]

> >> + while (start < end) {
> >> + size = PAGE_SIZE;
> >> + /*
> >> + * get the Shadow IPA if the page is mapped
> >> + * to L1 and also mapped to any of active L2.
> >> + */
> >
> > Why is L1 relevant here?
>
> We do map while L1 boots(early stage) in shadow S2, at that moment
> if the L1 mapped page is unmapped/migrated we do need to unmap from
> L1's S2 table also.

Sure. But you can also get a page that is mapped in L2 and not mapped
in the canonical S2, which is L1's. I more and more feel that you have
a certain misconception of how L1 gets its pages mapped.

>
> >
> >> + ret = get_shadow_ipa(mmu, start, &shadow_ipa, &size);
> >> + if (ret)
> >> + kvm_unmap_stage2_range(mmu, shadow_ipa, size);
> >> + start += size;
> >> + }
> >> + }
> >> + }
> >> +}
> >> +
> >> /* expects kvm->mmu_lock to be held */
> >> void kvm_nested_s2_flush(struct kvm *kvm)
> >> {
> >
> > There are a bunch of worrying issues with this patch. But more
> > importantly, this looks like a waste of effort until the core issues
> > that NV still has are solved, and I will not consider anything of the
> > sort until then.
>
> OK thanks for letting us know, I will pause the work on V2 of this
> patch until then.
>
> >
> > I get the ugly feeling that you are trying to make it look as if it
> > was "production ready", which it won't be for another few years,
> > specially if the few interested people (such as you) are ignoring the
> > core issues in favour of marketing driven features ("make it fast").
> >
>
> What are the core issues (please forgive me if you mentioned already)?
> certainly we will prioritise them than this.

AT is a big one. Maintenance interrupts are more or less broken. I'm
slowly plugging PAuth, but there's no testing whatsoever (running
Linux doesn't count). Lack of SVE support is also definitely a
blocker.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.