Re: [RFC PATCH 0/3] efi: Implement generic zboot support

From: Ard Biesheuvel
Date: Wed May 03 2023 - 14:13:55 EST


On Wed, 3 May 2023 at 19:55, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>
> On Fri, Apr 21, 2023, at 6:41 AM, Ard Biesheuvel wrote:
> > On Fri, 21 Apr 2023 at 15:30, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On Sun, Apr 16, 2023, at 5:07 AM, Ard Biesheuvel wrote:
> >> > This series is a proof-of-concept that implements support for the EFI
> >> > zboot decompressor for x86. It replaces the ordinary decompressor, and
> >> > instead, performs the decompression, KASLR randomization and the 4/5
> >> > level paging switch while running in the execution context of EFI.
> >>
> >> I like the concept. A couple high-level questions, since I haven’t dug into the code:
> >>
> >> Could zboot and bzImage be built into the same kernel image? That would get this into distros, and eventually someone could modify the legacy path to switch to long mode and invoke zboot (because zboot surely doesn’t need actual UEFI — just a sensible environment like what UEFI provides.)
> >>
> >
> > That's an interesting question, and to some extent, that is actually
> > what Evgeny's patch does: execute more of what the decompressor does
> > from inside the EFI runtime context.
> >
> > The main win with zboot imho is that we get rid of all the funky
> > heuristics that look for usable memory for trampolines and
> > decompression buffers in funky ways, and instead, just use the EFI
> > APIs for allocating pages and remapping them executable as needed
> > (which is the important piece here) I'd have to think about whether
> > there is any middle ground between this approach and Evgeny's - I'll
> > have to get back to you on that.
> >
>
> Hmm. I dug the tiniest bit into the history. The x86/boot/compressed stuff has an allocator! It's this:
>
> free_mem_ptr = heap; /* Heap */
> free_mem_end_ptr = heap + BOOT_HEAP_SIZE;
>
> plus a trivial and horrible malloc() implementation in include/linux/decompress/mm.h. There's one caller in x86/boot/compressed.
>
> And, once upon a time, the idea of allocating enough memory to store the kernel from the decompressor would have been a problem. I'm willing to claim that we should not even try to support x86 systems that have that little memory (at least not once they've gotten long mode or at least flat 32-bit protected mode working). We should not try to allocate below 1MB (my laptop will cry), but there's no need for that.
>
> So maybe the middle ground is to build a modern, simple malloc(), and back it by EFI when EFI is there and by just finding some free memory when EFI is not there?
>
> This would be risky -- someone might have a horrible machine that has trouble with a simple allocator.

The malloc() is the least or our concerns, tbh. It is only used by the
decompression library itself, and it is backed by a statically
allocated block of BSS.

Having just gone through this again, the main issues are:

1) The 4/5 level switching trampoline, which runs in the page tables
of the loader/EFI stub, and assumes that it is fine to grab a random
chunk of low memory, stash its contents somewhere, use it for some
code, a stack and a root level page table so we can do the x86 long
mode paging off/paging on salsa, and then copy back the contents and
carry on as if nothing happened. We currently have some code in the
stub that strips all NX restrictions from a generously overdimensioned
block of low memory so copying and running code like this actually
works.

2) We need an accurate description in the PE/COFF header of what needs
to be executable and what needs to be writable, so we can splt the
regions. This only matters for code that runs under EFI's mappings, so
not a lot.

3) The payload relocates itself to the end of the decompression buffer
so it doesn't overwrite itself before completing. This is fragile and
also unnecessary when there is a page allocator and plenty of memory,
but afaict, this all executes under the decompressor's own page tables
so the RO/NX attributes that EFI sets are not a concern here. It
would, of course, be nice if we could avoid relying on RWX mappings
here.

4) I think it was you who pointed out that the demand paging 1:1 map
should really only get triggered for data accesses and not code
accesses, so it would be nice if we could create such mappings with NX
attributes.

I've had another go at running the decompressed kernel directly,
without going through the decompressor logic at all, but I missed the
fact that SEV does a substantial amount of work in the decompressor
too, so I'm no longer convinced that this is a viable approach. But
I'm looking into this.

I just finished some patches [0] that only address 1), based on the
work I posted earlier. I'll send those out once -rc1 comes around.


[0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-x86-cleanup-la57