Re: [PATCH v18 060/121] KVM: TDX: TDP MMU TDX support

From: Binbin Wu
Date: Tue Jan 30 2024 - 10:34:06 EST




On 1/23/2024 7:53 AM, isaku.yamahata@xxxxxxxxx wrote:
From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>

Implement hooks of TDP MMU for TDX backend. TLB flush, TLB shootdown,
propagating the change private EPT entry to Secure EPT and freeing Secure
EPT page. TLB flush handles both shared EPT and private EPT. It flushes
shared EPT same as VMX. It also waits for the TDX TLB shootdown. For the
hook to free Secure EPT page, unlinks the Secure EPT page from the Secure
EPT so that the page can be freed to OS.

Propagate the entry change to Secure EPT. The possible entry changes are
present -> non-present(zapping) and non-present -> present(population). On
population just link the Secure EPT page or the private guest page to the
Secure EPT by TDX SEAMCALL. Because TDP MMU allows concurrent
zapping/population, zapping requires synchronous TLB shoot down with the
frozen EPT entry. It zaps the secure entry, increments TLB counter, sends
IPI to remote vcpus to trigger TLB flush, and then unlinks the private
guest page from the Secure EPT. For simplicity, batched zapping with
exclude lock is handled as concurrent zapping. Although it's inefficient,
it can be optimized in the future.

For MMIO SPTE, the spte value changes as follows.
initial value (suppress VE bit is set)
-> Guest issues MMIO and triggers EPT violation
-> KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
-> Guest MMIO resumes. It triggers VE exception in guest TD
-> Guest VE handler issues TDG.VP.VMCALL<MMIO>
-> KVM handles MMIO
-> Guest VE handler resumes its execution after MMIO instruction

Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>

---
v18:
- rename tdx_sept_page_aug() -> tdx_mem_page_aug()
- checkpatch: space => tab

v15 -> v16:
- Add the handling of TD_ATTR_SEPT_VE_DISABLE case.

v14 -> v15:
- Implemented tdx_flush_tlb_current()
- Removed unnecessary invept in tdx_flush_tlb(). It was carry over
from the very old code base.
---
arch/x86/kvm/mmu/spte.c | 3 +-
arch/x86/kvm/vmx/main.c | 71 +++++++-
arch/x86/kvm/vmx/tdx.c | 342 +++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h | 2 +-
arch/x86/kvm/vmx/tdx_ops.h | 6 +
arch/x86/kvm/vmx/x86_ops.h | 6 +
6 files changed, 424 insertions(+), 6 deletions(-)
[...]
+
+/*
+ * TLB shoot down procedure:
+ * There is a global epoch counter and each vcpu has local epoch counter.
+ * - TDH.MEM.RANGE.BLOCK(TDR. level, range) on one vcpu
+ * This blocks the subsequenct creation of TLB translation on that range.
+ * This corresponds to clear the present bit(all RXW) in EPT entry
+ * - TDH.MEM.TRACK(TDR): advances the epoch counter which is global.
+ * - IPI to remote vcpus
+ * - TDExit and re-entry with TDH.VP.ENTER on remote vcpus
+ * - On re-entry, TDX module compares the local epoch counter with the global
+ * epoch counter. If the local epoch counter is older than the global epoch
+ * counter, update the local epoch counter and flushes TLB.
+ */
+static void tdx_track(struct kvm *kvm)
+{
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ u64 err;
+
+ KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm);
+ /* If TD isn't finalized, it's before any vcpu running. */
+ if (unlikely(!is_td_finalized(kvm_tdx)))
+ return;
+
+ /*
+ * tdx_flush_tlb() waits for this function to issue TDH.MEM.TRACK() by
+ * the counter. The counter is used instead of bool because multiple
+ * TDH_MEM_TRACK() can be issued concurrently by multiple vcpus.
+ */
+ atomic_inc(&kvm_tdx->tdh_mem_track);
+ /*
+ * KVM_REQ_TLB_FLUSH waits for the empty IPI handler, ack_flush(), with
+ * KVM_REQUEST_WAIT.
+ */
+ kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
+
+ do {
+ /*
+ * kvm_flush_remote_tlbs() doesn't allow to return error and
+ * retry.
+ */
+ err = tdh_mem_track(kvm_tdx->tdr_pa);
+ } while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));

Why the sequence of the code is different from the description of the function.
In the description, do the TDH.MEM.TRACK before IPIs.
But in the code, do TDH.MEM.TRACK after IPIs?


+
+ /* Release remote vcpu waiting for TDH.MEM.TRACK in tdx_flush_tlb(). */
+ atomic_dec(&kvm_tdx->tdh_mem_track);
+
+ if (KVM_BUG_ON(err, kvm))
+ pr_tdx_error(TDH_MEM_TRACK, err, NULL);
+
+}
+
[...]