Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

From: Isaku Yamahata
Date: Thu Mar 28 2024 - 01:34:44 EST


On Thu, Mar 28, 2024 at 02:49:56PM +1300,
"Huang, Kai" <kai.huang@xxxxxxxxx> wrote:

>
>
> On 28/03/2024 11:53 am, Isaku Yamahata wrote:
> > On Tue, Mar 26, 2024 at 02:43:54PM +1300,
> > "Huang, Kai" <kai.huang@xxxxxxxxx> wrote:
> >
> > > ... continue the previous review ...
> > >
> > > > +
> > > > +static void tdx_reclaim_control_page(unsigned long td_page_pa)
> > > > +{
> > > > + WARN_ON_ONCE(!td_page_pa);
> > >
> > > From the name 'td_page_pa' we cannot tell whether it is a control page, but
> > > this function is only intended for control page AFAICT, so perhaps a more
> > > specific name.
> > >
> > > > +
> > > > + /*
> > > > + * TDCX are being reclaimed. TDX module maps TDCX with HKID
> > >
> > > "are" -> "is".
> > >
> > > Are you sure it is TDCX, but not TDCS?
> > >
> > > AFAICT TDCX is the control structure for 'vcpu', but here you are handling
> > > the control structure for the VM.
> >
> > TDCS, TDVPR, and TDCX. Will update the comment.
>
> But TDCX, TDVPR are vcpu-scoped. Do you want to mention them _here_?

So I'll make the patch that frees TDVPR, TDCX will change this comment.


> Otherwise you will have to explain them.
>
> [...]
>
> > > > +
> > > > +void tdx_mmu_release_hkid(struct kvm *kvm)
> > > > +{
> > > > + bool packages_allocated, targets_allocated;
> > > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > > + cpumask_var_t packages, targets;
> > > > + u64 err;
> > > > + int i;
> > > > +
> > > > + if (!is_hkid_assigned(kvm_tdx))
> > > > + return;
> > > > +
> > > > + if (!is_td_created(kvm_tdx)) {
> > > > + tdx_hkid_free(kvm_tdx);
> > > > + return;
> > > > + }
> > >
> > > I lost tracking what does "td_created()" mean.
> > >
> > > I guess it means: KeyID has been allocated to the TDX guest, but not yet
> > > programmed/configured.
> > >
> > > Perhaps add a comment to remind the reviewer?
> >
> > As Chao suggested, will introduce state machine for vm and vcpu.
> >
> > https://lore.kernel.org/kvm/ZfvI8t7SlfIsxbmT@chao-email/
>
> Could you elaborate what will the state machine look like?
>
> I need to understand it.

Not yet. Chao only propose to introduce state machine. Right now it's just an
idea.


> > How about this?
> >
> > /*
> > * We need three SEAMCALLs, TDH.MNG.VPFLUSHDONE(), TDH.PHYMEM.CACHE.WB(), and
> > * TDH.MNG.KEY.FREEID() to free the HKID.
> > * Other threads can remove pages from TD. When the HKID is assigned, we need
> > * to use TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE().
> > * TDH.PHYMEM.PAGE.RECLAIM() is needed when the HKID is free. Get lock to not
> > * present transient state of HKID.
> > */
>
> Could you elaborate why it is still possible to have other thread removing
> pages from TD?
>
> I am probably missing something, but the thing I don't understand is why
> this function is triggered by MMU release? All the things done in this
> function don't seem to be related to MMU at all.

The KVM releases EPT pages on MMU notifier release. kvm_mmu_zap_all() does. If
we follow that way, kvm_mmu_zap_all() zaps all the Secure-EPTs by
TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE(). Because
TDH.MEM.{SEPT, PAGE}.REMOVE() is slow, we can free HKID before kvm_mmu_zap_all()
to use TDH.PHYMEM.PAGE.RECLAIM().


> IIUC, by reaching here, you must already have done VPFLUSHDONE, which should
> be called when you free vcpu?

Not necessarily.


> Freeing vcpus is done in
> kvm_arch_destroy_vm(), which is _after_ mmu_notifier->release(), in which
> this tdx_mmu_release_keyid() is called?

guest memfd complicates things. The race is between guest memfd release and mmu
notifier release. kvm_arch_destroy_vm() is called after closing all kvm fds
including guest memfd.

Here is the example. Let's say, we have fds for vhost, guest_memfd, kvm vcpu,
and kvm vm. The process is exiting. Please notice vhost increments the
reference of the mmu to access guest (shared) memory.

exit_mmap():
Usually mmu notifier release is fired. But not yet because of vhost.

exit_files()
close vhost fd. vhost starts timer to issue mmput().

close guest_memfd. kvm_gmem_release() calls kvm_mmu_unmap_gfn_range().
kvm_mmu_unmap_gfn_range() eventually this calls TDH.MEM.SEPT.REMOVE()
and TDH.MEM.PAGE.REMOVE(). This takes time because it processes whole
guest memory. Call kvm_put_kvm() at last.

During unmapping on behalf of guest memfd, the timer of vhost fires to call
mmput(). It triggers mmu notifier release.

Close kvm vcpus/vm. they call kvm_put_kvm(). The last one calls
kvm_destroy_vm().

It's ideal to free HKID first for efficiency. But KVM doesn't have control on
the order of fds.


> But here we are depending vcpus to be freed before tdx_mmu_release_hkid()?

Not necessarily.
--
Isaku Yamahata <isaku.yamahata@xxxxxxxxx>