Re: [PATCH 0/6] efi/x86: Avoid legacy decompressor during EFI boot

From: Ard Biesheuvel
Date: Wed May 03 2023 - 14:17:47 EST


On Wed, 3 May 2023 at 19:58, Tom Lendacky <thomas.lendacky@xxxxxxx> wrote:
>
> On 5/2/23 11:08, Tom Lendacky wrote:
> > On 5/2/23 08:39, Ard Biesheuvel wrote:
> >> On Tue, 2 May 2023 at 15:37, Tom Lendacky <thomas.lendacky@xxxxxxx> wrote:
> >>>
> >>> On 4/24/23 11:57, Ard Biesheuvel wrote:
> >>>> This series is conceptually a combination of Evgeny's series [0] and
> >>>> mine [1], both of which attempt to make the early decompressor code more
> >>>> amenable to executing in the EFI environment with stricter handling of
> >>>> memory permissions.
> >>>>
> >>>> My series [1] implemented zboot for x86, by getting rid of the entire
> >>>> x86 decompressor, and replacing it with existing EFI code that does the
> >>>> same but in a generic way. The downside of this is that only EFI boot is
> >>>> supported, making it unviable for distros, which need to support BIOS
> >>>> boot and hybrid EFI boot modes that omit the EFI stub.
> >>>>
> >>>> Evgeny's series [0] adapted the entire decompressor code flow to allow
> >>>> it to execute in the EFI context as well as the bare metal context, and
> >>>> this involves changes to the 1:1 mapping code and the page fault
> >>>> handlers etc, none of which are really needed when doing EFI boot in the
> >>>> first place.
> >>>>
> >>>> So this series attempts to occupy the middle ground here: it makes
> >>>> minimal changes to the existing decompressor so some of it can be called
> >>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress
> >>>> the kernel and boot it directly, without relying on the trampoline code,
> >>>> page table code or page fault handling code. This allows us to get rid
> >>>> of quite a bit of unsavory EFI stub code, and replace it with two clear
> >>>> invocations of the EFI firmware APIs to clear NX restrictions from
> >>>> allocations that have been populated with executable code.
> >>>>
> >>>> The only code that is being reused is the decompression library itself,
> >>>> along with the minimal ELF parsing that is required to copy the ELF
> >>>> segments in place, and the relocation processing that fixes up absolute
> >>>> symbol references to refer to the correct virtual addresses.
> >>>>
> >>>> Note that some of Evgeny's changes to clean up the PE/COFF header
> >>>> generation will still be needed, but I've omitted those here for
> >>>> brevity.
> >>>
> >>> I tried booting an SEV and an SEV-ES guest using this and both failed
> >>> to boot:
> >>>
> >>> EFI stub: WARNING: Decompression failed: Out of memory while allocating
> >>> z_stream
> >>>
> >>> I'll have to take a closer look as to why, but it might be a couple of
> >>> days before I can get to it.
> >>>
> >>
> >> Thanks Tom.
> >>
> >> The internal malloc() seems to be failing, which is often caused by
> >> BSS clearing problems. Could you elaborate a little bit on the boot
> >> environment you are using here?
> >
> > I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my
> > host/hypervisor and guest kernel and the current OVMF tree built using
> > OvmfPkgX64.dsc.
> >
> > I was originally using the current merge window Linux, but moved to the
> > release version just to . With the release version SEV and SEV-ES still
> > fail to
> > boot, but SEV actually #GPs now. And some of the register contents look
> > like encrypted data:
> >
> > ConvertPages: range 1000000 - 4FA1FFF covers multiple entries
> > !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID -
> > 00000000 !!!!
> > ExceptionData - 0000000000000000
> > RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206
> > RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B
> > RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000
> > RSI - 0000000001000000, RDI - 1FBA02A45943DE68
> > R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018
> > R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98
> > R14 - 0000000001000000, R15 - 000000007E0A5198
> > DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030
> > GS - 0000000000000030, SS - 0000000000000030
> > CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000
> > CR4 - 0000000000000668, CR8 - 0000000000000000
> > DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
> > DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
> > GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000
> > IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000
> > FXSAVE_STATE - 000000007FD96F80
> > !!!! Find image based on IP(0x597E71C1)
> > /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable
> > RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!!
> >
> > So, yes, probably an area of memory that was zeroes when mapped
> > unencrypted, but wasn't cleared after changing the mapping to
> > encrypted.
>
> Yes, looks like a bss clearing issue. If I add __section(".data") to
> free_mem_ptr and free_mem_end_ptr in arch/x86/boot/compressed/misc.c and
> to malloc_ptr and malloc_cnt in include/linux/decompress/mm.h, then I can
> boot an SEV guest.
>
> However, an SEV-ES guest is triple faulting. This looks to be because
> we're still on the EFI CS of 0x38 after switching GDTs in
> arch/x86/kernel/head_64.S by calling startup_64_setup_env(). Before
> switching to the kernel CS, we take a #VC (from CPUID calls in sme_enable)
> and things blow up on the iretq. Moving the block headed by the comment
> "Now switch to __KERNEL_CS so IRET works reliably" to just after calling
> startup_64_setup_env() fixes it and an SEV-ES guest can boot.
>
> This worked before because I believe we switched off the EFI CS as part of
> the kernel decompressor support and so this bug wasn't exposed. But this
> needs to be fixed regardless of this series.
>

Very interesting. I was under the assumption that everything that goes
on in sev_enable() in the decompressor would be rather indispensable
to boot in SEV mode (which I only spotted today) so I am quite
surprised that things just appear to work. (There is some 32-bit SEV
code in the decompressor as well that obviously never gets called when
booting in long mode, but I hadn't quite grasped how much other SEV
code there actually is)