Re: [RFC PATCH] KVM: x86: inhibit APICv upon detecting direct APIC access from L2

From: Ake Koomsin
Date: Tue Aug 08 2023 - 12:11:32 EST


On Mon, 07 Aug 2023 17:00:58 +0300
Maxim Levitsky <mlevitsk@xxxxxxxxxx> wrote:

> Is there a good reason why KVM doesn't expose APIC memslot to a
> nested guest? While nested guest runs, the L1's APICv is "inhibited"
> effectively anyway, so writes to this memslot should update APIC
> registers and be picked up by APICv hardware when L1 resumes
> execution.
>
> Since APICv alows itself to be inhibited due to other reasons, it
> means that just like AVIC, it should be able to pick up arbitrary
> changes to APIC registers which happened while it was inhibited, just
> like AVIC does.
>
> I'll take a look at the code to see if APICv does this (I know AVIC's
> code much better that APICv's)
>
> Is there a reproducer for this bug?
>
> Best regards,
> Maxim Levitsky

From reading old commits (3a2936dedd20 and 1313cc2bd8f6), I interprete that
current KVM implementation does not expect direct APIC access from L2 guests.
I assume that there might be some challenging implementation issues.

To reproduce the problem, we need to run a micro hypervisor named BitVisor on
KVM. This hypervisor, when running on real machine, lets its guest access
physical APIC directly. As BitVisor intends to run on real machine, when running
under KVM, it conceals all KVM related features reported through CPUID. The L2
guest will initialize and run as if it runs on a physical machine. We also need
an Intel machine that support APICv. (I test on Intel 13th machine. The problem
should also be reproducible on Intel 12th machine). Current BitVisor's SVM
implementations always monitor MMIO access so we cannot reproduce the problem.

BitVisor VMX implementation under UEFI environment by default hooks the APIC
access during initialization. The purpose of this APIC access hook is to
bootstrap AP processors during UEFI ExitBootServices. When booting a guest OS,
the firmware sends INIT signal during ExitBootServices. BitVisor then bootstrap
AP processors, put them to guest mode, and unhook APIC access. After this,
the guest can now access APIC memory directly.

As far as I understand the KVM implemntation, when BitVisor still hooks APIC
access, EPT_VIOLATION occurs when L2 guest accesses APIC page. The EPT_VIOLATION
is then forwarded to BitVisor. BitVisor eventually accesses APIC on behalf of
the L2 guest. In this case, APICv works properly because the access is from L1.
After BitVisor unhooks the APIC page, the first access to APIC from the L2 guest
goes to EPT_VIOLATION handling path. This handling path marks the APIC page with
a reserved flag, and causes the access to retry eventually. Subsequent accesses
are then handled in EPT_MISCONFIG path, emulating the MMIO access. Interrupt
seems to disappear after this.

Here is the steps to reproduce the problem.

1) hg clone http://hg.code.sf.net/p/bitvisor/code bitvisor-code

2) Enter the cloned directory and type 'make' (No need to worry about warnings
at the moment. The default configuration is good enough to reproduce the
problem). We now have bitvisor.elf after the compilation.

3) Enter boot/uefi-boot, and type 'make' to compile the UEFI bootloader. We
need mingw for this. We now have loadvmm.efi after the compilation.

4) Put bitvisor.elf and loadvmm.efi to together in a folder. The folder
is going to look like the following:
~/x86_test
├── bitvisor.elf
└── loadvmm.efi

5) Run the following qemu command. Replace UEFI firmware path and other
parameters as you prefer. Make sure -smp 2 is there. Otherwise, there will be
no INIT signal during UEFI ExitBootServices. (I use QEMU 8.0.3)

qemu-system-x86_64 -cpu host -enable-kvm -bios /usr/share/edk2-ovmf/OVMF_CODE.fd \
-drive file=fat:rw:~/x86_test/,format=raw \
-cdrom ~/Downloads/Fedora-Workstation-Live-x86_64-38-1.6.iso \
-M q35 -m 8192 -smp 2 -serial stdio

6) During the launch, enter the bios config by hitting esc key repeatedly.
Then, select 'Boot Manager' and choose 'EFI Internel Shell' to enter the
UEFI shell.

7) The directory we specify in the command should be at fs0. Type 'fs0:' in
the shell.

8) Type 'loadvmm.efi' to load BitVisor. We should see the following message

Loading ...............................................................
Starting BitVisor...
Copyright (c) 2007, 2008 University of Tsukuba
All rights reserved.
ACPI DMAR not found.
FACS address 0x7FBDD000
Module not found.
Processor 0 (BSP)
ooooooooooooooooooooooooooooooooooooooooooooooooooo
...
MCFG [0] 0000:00-FF (B0000000,10000000)
Starting a virtual machine.

9) We should now return to the shell. Right now we are running as a L2 guest.

10) Next is to boot Linux from the live cd or your prefered method. We can see
the panic related to "panic - not syncing: IO-APIC + timer doesn't work!".
The panic can be reproduced quite easy. Even though, it happens to pass to
timer check, or you specify 'no_timer_check' boot parameter, it will stall
during SMP bringup.

The idea from step 6 to step 10 is to start BitVisor first, and start Linux on
top of it. You can adjust the step as you like. Feel free to ask me anything
regarding reproducing the problem with BitVisor if the giving steps are not
sufficient.

The problem does not happen when enable_apicv=N. Note that SMP bringup with
enable_apicv=N can fail. This is another problem. We don't have to worry about
this for now. Linux seems to have no delay between INIT DEASSERT and SIPI during
its SMP bringup. This can easily makes INIT and SIPI pending together resultling
in signal lost.

I admit that my knowledge on KVM and APICv is very limited. I may misunderstand
the problem. If you don't mind, would it be possible for you to guide me which
code path should I pay attention to? I would love to learn to find out the
actual cause of the problem.


Best Regards
Ake Koomsin