Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

From: Andy Lutomirski
Date: Mon Mar 11 2024 - 18:18:05 EST




On Mon, Mar 11, 2024, at 9:46 AM, Pasha Tatashin wrote:
> Add dynamic_stack_fault() calls to the kernel faults, and also declare
> HAVE_ARCH_DYNAMIC_STACK = y, so that dynamic kernel stacks can be
> enabled on x86 architecture.
>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@xxxxxxxxxx>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/kernel/traps.c | 3 +++
> arch/x86/mm/fault.c | 3 +++
> 3 files changed, 7 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5edec175b9bf..9bb0da3110fa 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -197,6 +197,7 @@ config X86
> select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD
> select HAVE_ARCH_USERFAULTFD_MINOR if X86_64 && USERFAULTFD
> select HAVE_ARCH_VMAP_STACK if X86_64
> + select HAVE_ARCH_DYNAMIC_STACK if X86_64
> select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
> select HAVE_ARCH_WITHIN_STACK_FRAMES
> select HAVE_ASM_MODVERSIONS
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index c3b2f863acf0..cc05401e729f 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
> }
> #endif
>
> + if (dynamic_stack_fault(current, address))
> + return;
> +

Sorry, but no, you can't necessarily do this. I say this as the person who write this code, and I justified my code on the basis that we are not recovering -- we're jumping out to a different context, and we won't crash if the origin context for the fault is corrupt. The SDM is really quite unambiguous about it: we're in an "abort" context, and returning is not allowed. And I this may well be is the real deal -- the microcode does not promise to have the return frame and the actual faulting context matched up here, and there's is no architectural guarantee that returning will do the right thing.

Now we do have some history of getting a special exception, e.g. for espfix64. But espfix64 is a very special case, and the situation you're looking at is very general. So unless Intel and AMD are both wiling to publicly document that it's okay to handle stack overflow, where any instruction in the ISA may have caused the overflow, like this, then we're not going to do it.

There are some other options: you could pre-map

Also, I think the whole memory allocation concept in this whole series is a bit odd. Fundamentally, we *can't* block on these stack faults -- we may be in a context where blocking will deadlock. We may be in the page allocator. Panicing due to kernel stack allocation would be very unpleasant. But perhaps we could have a rule that a task can only be scheduled in if there is sufficient memory available for its stack. And perhaps we could avoid every page-faulting by filling in the PTEs for the potential stack pages but leaving them un-accessed. I *think* that all x86 implementations won't fill the TLB for a non-accessed page without also setting the accessed bit, so the performance hit of filling the PTEs, running the task, and then doing the appropriate synchronization to clear the PTEs and read the accessed bit on schedule-out to release the pages may not be too bad. But you would need to do this cautiously in the scheduler, possibly in the *next* task but before the prev task is actually released enough to be run on a different CPU. It's going to be messy.