[RFC PATCH 00/47] Address Space Isolation for KVM

From: Junaid Shahid
Date: Wed Feb 23 2022 - 00:23:04 EST


This patch series is a proof-of-concept RFC for an end-to-end implementation of
Address Space Isolation for KVM. It has similar goals and a somewhat similar
high-level design as the original ASI patches from Alexandre Chartre
([1],[2],[3],[4]), but with a different underlying implementation. This also
includes several memory management changes to help with differentiating between
sensitive and non-sensitive memory and mapping of non-sensitive memory into the
ASI restricted address spaces.

This RFC is intended as a demonstration of what a full ASI implementation for
KVM could look like, not necessarily as a direct proposal for what might
eventually be merged. In particular, these patches do not yet implement KPTI on
top of ASI, although the framework is generic enough to be able to support it.
Similarly, these patches do not include non-sensitive annotations for data
structures that did not get frequently accessed during execution of our test
workloads, but the framework is designed such that new non-sensitive memory
annotations can be added trivially.

The patches apply on top of Linux v5.16. These patches are also available via
gerrit at https://linux-review.googlesource.com/q/topic:asi-rfc.

Background
==========
Address Space Isolation is a comprehensive security mitigation for several types
of speculative execution attacks. Even though the kernel already has several
speculative execution vulnerability mitigations, some of them can be quite
expensive if enabled fully e.g. to fully mitigate L1TF using the existing
mechanisms requires doing an L1 cache flush on every single VM entry as well as
disabling hyperthreading altogether. (Although core scheduling can provide some
protection when hyperthreading is enabled, it is not sufficient by itself to
protect against all leaks unless sibling hyperthread stunning is also performed
on every VM exit.) ASI provides a much less expensive mitigation for such
vulnerabilities while still providing an almost similar level of protection.

There are a couple of basic insights/assumptions behind ASI:

1. Most execution paths in the kernel (especially during virtual machine
execution) access only memory that is not particularly sensitive even if it were
to get leaked to the executing process/VM (setting aside for a moment what
exactly should be considered sensitive or non-sensitive).
2. Even when executing speculatively, the CPU can generally only bring memory
that is mapped in the current page tables into its various caches and internal
buffers.

Given these, the idea of using ASI to thwart speculative attacks is that we can
execute the kernel using a restricted set of page tables most of the time and
switch to the full unrestricted kernel address space only when the kernel needs
to access something that is not mapped in the restricted address space. And we
keep track of when a switch to the full kernel address space is done, so that
before returning back to the process/VM, we can switch back to the restricted
address space. In the paths where the kernel is able to execute entirely while
remaining in the restricted address space, we can skip other mitigations for
speculative execution attacks (such as L1 cache / micro-arch buffer flushes,
sibling hyperthread stunning etc.). Only in the cases where we do end up
switching the page tables, we perform these more expensive mitigations. Assuming
that happens relatively infrequently, the performance can be significantly
better compared to performing these mitigations all the time.

Please note that although we do have a sibling hyperthread stunning
implementation internally, which is fully integrated with KVM-ASI, it is not
included in this RFC for the time being. The earlier upstream proposal for
sibling stunning [6] could potentially be integrated into an upstream ASI
implementation.

Basic concepts
==============
Different types of restricted address spaces are represented by different ASI
classes. For instance, KVM-ASI is an ASI class used during VM execution. KPTI
would be another ASI class. An ASI instance (struct asi) represents a single
restricted address space. There is a separate ASI instance for each untrusted
context (e.g. a userspace process, a VM, or even a single VCPU etc.) Note that
there can be multiple untrusted security contexts (and thus multiple restricted
address spaces) within a single process e.g. in the case of VMs, the userspace
process is a different security context than the guest VM, and in principle,
even each VCPU could be considered a separate security context (That would be
primarily useful for securing nested virtualization).

In this RFC, a process can have at most one ASI instance of each class, though
this is not an inherent limitation and multiple instances of the same class
should eventually be supported. (A process can still have ASI instances of
different classes e.g. KVM-ASI and KPTI.) In fact, in principle, it is not even
entirely necessary to tie an ASI instance to a process. That is just a
simplification for the initial implementation.

An asi_enter operation switches into the restricted address space represented by
the given ASI instance. An asi_exit operation switches to the full unrestricted
kernel address space. Each ASI class can provide hooks to be executed during
these operations, which can be used to perform speculative attack mitigations
relevant to that class. For instance, the KVM-ASI hooks would perform a
sibling-hyperthread-stun operation in the asi_exit hook, and L1-flush/MDS-clear
and sibling-hyperthread-unstun operations in the asi_enter hook. On the other
hand, the hooks for the KPTI class would be NO-OP, since the switching of the
page tables is enough mitigation in that case.

If the kernel attempts to access memory that is not mapped in the currently
active ASI instance, the page fault handler automatically performs an asi_exit
operation. This means that except for a few critical pieces of memory, leaving
something out of an unrestricted address space will result in only a performance
hit, rather than a catastrophic failure. The kernel can also perform explicit
asi_exit operations in some paths as needed.

Apart from the page fault handler, other exceptions and interrupts (even NMIs)
do not automatically cause an asi_exit and could potentially be executed
completely within a restricted address space if they don't end up accessing any
sensitive piece of memory.

The mappings within a restricted address space are always a subset of the full
kernel address space and each mapping is always the same as the corresponding
mapping in the full kernel address space. This is necessary because we could
potentially end up performing an asi_exit at any point.

Although this RFC only includes an implementation of the KVM-ASI class, a KPTI
class could also be implemented on top of the same infrastructure. Furthermore,
in the future we could also implement a KPTI-Next class that actually uses the
ASI model for userspace processes i.e. mapping non-sensitive kernel memory in
the restricted address space and trying to execute most syscalls/interrupts
without switching to the full kernel address space, as opposed to the current
KPTI which requires an address space switch on every kernel/user mode
transition.

Memory classification
=====================
We divide memory into three categories.

1. Sensitive memory
This is memory that should never get leaked to any process or VM. Sensitive
memory is only mapped in the unrestricted kernel page tables. By default, all
memory is considered sensitive unless specifically categorized otherwise.

2. Globally non-sensitive memory
This is memory that does not present a substantial security threat even if it
were to get leaked to any process or VM in the system. Globally non-sensitive
memory is mapped in the restricted address spaces for all processes.

3. Locally non-sensitive memory
This is memory that does not present a substantial security threat if it were to
get leaked to the currently running process or VM, but would present a security
issue if it were to get leaked to any other process or VM in the system.
Examples include userspace memory (or guest memory in the case of VMs) or kernel
structures containing userspace/guest register context etc. Locally
non-sensitive memory is mapped only in the restricted address space of a single
process.

Various mechanisms are provided to annotate different types of memory (static,
buddy allocator, slab, vmalloc etc.) as globally or locally non-sensitive. In
addition, the ASI infrastructure takes care to ensure that different classes of
memory do not share the same physical page. This includes separation of
sensitive, globally non-sensitive and locally non-sensitive memory into
different pages and also separation of locally non-sensitive memory for
different processes into different pages as well.

What exactly should be considered non-sensitive (either globally or locally) is
somewhat open-ended. Some things are clearly sensitive or non-sensitive, but
many things also fall into a gray area, depending on how paranoid one wants to
be. For this proof of concept, we have generally treated such things as
non-sensitive, though that may not necessarily be the ideal classification in
each case. Similarly, there is also a gray area between globally and locally
non-sensitive classifications in some cases, and in those cases this RFC has
mostly erred on the side of marking them as locally non-sensitive, even though
many of those cases could likely be safely classified as globally non-sensitive.

Although this implementation includes fairly extensive support for marking most
types of dynamically allocated memory as locally non-sensitive, it is possibly
feasible, at least for KVM-ASI, to get away with a simpler implementation (such
as [5]), if we are very selective about what memory we treat as locally
non-sensitive (as opposed to globally non-sensitive). Nevertheless, the more
general mechanism is included in this proof of concept as an illustration for
what could be done if we really needed to treat any arbitrary kernel memory as
locally non-sensitive.

It is also possible to have ASI classes that do not utilize the above described
infrastructure and instead manage all the memory mappings inside the restricted
address space on their own.


References
==========
[1] https://lore.kernel.org/lkml/1557758315-12667-1-git-send-email-alexandre.chartre@xxxxxxxxxx
[2] https://lore.kernel.org/lkml/1562855138-19507-1-git-send-email-alexandre.chartre@xxxxxxxxxx
[3] https://lore.kernel.org/lkml/1582734120-26757-1-git-send-email-alexandre.chartre@xxxxxxxxxx
[4] https://lore.kernel.org/lkml/20200504144939.11318-1-alexandre.chartre@xxxxxxxxxx
[5] https://lore.kernel.org/lkml/20190612170834.14855-1-mhillenb@xxxxxxxxx
[6] https://lore.kernel.org/lkml/20200815031908.1015049-1-joel@xxxxxxxxxxxxxxxxx

Cc: Paul Turner <pjt@xxxxxxxxxx>
Cc: Jim Mattson <jmattson@xxxxxxxxxx>
Cc: Alexandre Chartre <alexandre.chartre@xxxxxxxxxx>
Cc: Mike Rapoport <rppt@xxxxxxxxxxxxx>
Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>
Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Andy Lutomirski <luto@xxxxxxxxxx>


Junaid Shahid (32):
mm: asi: Introduce ASI core API
mm: asi: Add command-line parameter to enable/disable ASI
mm: asi: Switch to unrestricted address space when entering scheduler
mm: asi: ASI support in interrupts/exceptions
mm: asi: Make __get_current_cr3_fast() ASI-aware
mm: asi: ASI page table allocation and free functions
mm: asi: Functions to map/unmap a memory range into ASI page tables
mm: asi: Add basic infrastructure for global non-sensitive mappings
mm: Add __PAGEFLAG_FALSE
mm: asi: Support for global non-sensitive direct map allocations
mm: asi: Global non-sensitive vmalloc/vmap support
mm: asi: Support for global non-sensitive slab caches
mm: asi: Disable ASI API when ASI is not enabled for a process
kvm: asi: Restricted address space for VM execution
mm: asi: Support for mapping non-sensitive pcpu chunks
mm: asi: Aliased direct map for local non-sensitive allocations
mm: asi: Support for pre-ASI-init local non-sensitive allocations
mm: asi: Support for locally nonsensitive page allocations
mm: asi: Support for locally non-sensitive vmalloc allocations
mm: asi: Add support for locally non-sensitive VM_USERMAP pages
mm: asi: Add support for mapping all userspace memory into ASI
mm: asi: Support for local non-sensitive slab caches
mm: asi: Avoid warning from NMI userspace accesses in ASI context
mm: asi: Use separate PCIDs for restricted address spaces
mm: asi: Avoid TLB flushes during ASI CR3 switches when possible
mm: asi: Avoid TLB flush IPIs to CPUs not in ASI context
mm: asi: Reduce TLB flushes when freeing pages asynchronously
mm: asi: Add API for mapping userspace address ranges
mm: asi: Support for non-sensitive SLUB caches
x86: asi: Allocate FPU state separately when ASI is enabled.
kvm: asi: Map guest memory into restricted ASI address space
kvm: asi: Unmap guest memory from ASI address space when using nested
virt

Ofir Weisse (15):
asi: Added ASI memory cgroup flag
mm: asi: Added refcounting when initilizing an asi
mm: asi: asi_exit() on PF, skip handling if address is accessible
mm: asi: Adding support for dynamic percpu ASI allocations
mm: asi: ASI annotation support for static variables.
mm: asi: ASI annotation support for dynamic modules.
mm: asi: Skip conventional L1TF/MDS mitigations
mm: asi: support for static percpu DEFINE_PER_CPU*_ASI
mm: asi: Annotation of static variables to be nonsensitive
mm: asi: Annotation of PERCPU variables to be nonsensitive
mm: asi: Annotation of dynamic variables to be nonsensitive
kvm: asi: Splitting kvm_vcpu_arch into non/sensitive parts
mm: asi: Mapping global nonsensitive areas in asi_global_init
kvm: asi: Do asi_exit() in vcpu_run loop before returning to userspace
mm: asi: Properly un/mapping task stack from ASI + tlb flush

arch/alpha/include/asm/Kbuild | 1 +
arch/arc/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 +
arch/arm64/include/asm/Kbuild | 1 +
arch/csky/include/asm/Kbuild | 1 +
arch/h8300/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/ia64/include/asm/Kbuild | 1 +
arch/m68k/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/nds32/include/asm/Kbuild | 1 +
arch/nios2/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 1 +
arch/powerpc/include/asm/Kbuild | 1 +
arch/riscv/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 1 +
arch/x86/events/core.c | 6 +-
arch/x86/events/intel/bts.c | 2 +-
arch/x86/events/intel/core.c | 2 +-
arch/x86/events/msr.c | 2 +-
arch/x86/events/perf_event.h | 4 +-
arch/x86/include/asm/asi.h | 215 ++++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/current.h | 2 +-
arch/x86/include/asm/debugreg.h | 2 +-
arch/x86/include/asm/desc.h | 2 +-
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/fpu/api.h | 3 +-
arch/x86/include/asm/hardirq.h | 2 +-
arch/x86/include/asm/hw_irq.h | 2 +-
arch/x86/include/asm/idtentry.h | 25 +-
arch/x86/include/asm/kvm_host.h | 124 +-
arch/x86/include/asm/page.h | 19 +-
arch/x86/include/asm/page_64.h | 27 +-
arch/x86/include/asm/page_64_types.h | 20 +
arch/x86/include/asm/percpu.h | 2 +-
arch/x86/include/asm/pgtable_64_types.h | 10 +
arch/x86/include/asm/preempt.h | 2 +-
arch/x86/include/asm/processor.h | 17 +-
arch/x86/include/asm/smp.h | 2 +-
arch/x86/include/asm/tlbflush.h | 49 +-
arch/x86/include/asm/topology.h | 2 +-
arch/x86/kernel/alternative.c | 2 +-
arch/x86/kernel/apic/apic.c | 2 +-
arch/x86/kernel/apic/x2apic_cluster.c | 8 +-
arch/x86/kernel/cpu/bugs.c | 2 +-
arch/x86/kernel/cpu/common.c | 12 +-
arch/x86/kernel/e820.c | 7 +-
arch/x86/kernel/fpu/core.c | 47 +-
arch/x86/kernel/fpu/init.c | 7 +-
arch/x86/kernel/fpu/internal.h | 1 +
arch/x86/kernel/fpu/xstate.c | 21 +-
arch/x86/kernel/head_64.S | 12 +
arch/x86/kernel/hw_breakpoint.c | 2 +-
arch/x86/kernel/irq.c | 2 +-
arch/x86/kernel/irqinit.c | 2 +-
arch/x86/kernel/nmi.c | 6 +-
arch/x86/kernel/process.c | 13 +-
arch/x86/kernel/setup.c | 4 +-
arch/x86/kernel/setup_percpu.c | 4 +-
arch/x86/kernel/smp.c | 2 +-
arch/x86/kernel/smpboot.c | 3 +-
arch/x86/kernel/traps.c | 2 +
arch/x86/kernel/tsc.c | 10 +-
arch/x86/kernel/vmlinux.lds.S | 2 +-
arch/x86/kvm/cpuid.c | 18 +-
arch/x86/kvm/kvm_cache_regs.h | 22 +-
arch/x86/kvm/lapic.c | 11 +-
arch/x86/kvm/mmu.h | 16 +-
arch/x86/kvm/mmu/mmu.c | 209 ++--
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 40 +-
arch/x86/kvm/mmu/spte.c | 6 +-
arch/x86/kvm/mmu/spte.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 14 +-
arch/x86/kvm/mtrr.c | 2 +-
arch/x86/kvm/svm/nested.c | 34 +-
arch/x86/kvm/svm/sev.c | 70 +-
arch/x86/kvm/svm/svm.c | 52 +-
arch/x86/kvm/trace.h | 10 +-
arch/x86/kvm/vmx/capabilities.h | 14 +-
arch/x86/kvm/vmx/nested.c | 90 +-
arch/x86/kvm/vmx/vmx.c | 152 ++-
arch/x86/kvm/x86.c | 315 +++--
arch/x86/kvm/x86.h | 4 +-
arch/x86/mm/Makefile | 1 +
arch/x86/mm/asi.c | 1397 ++++++++++++++++++++++
arch/x86/mm/fault.c | 67 +-
arch/x86/mm/init.c | 7 +-
arch/x86/mm/init_64.c | 26 +-
arch/x86/mm/kaslr.c | 34 +-
arch/x86/mm/mm_internal.h | 5 +
arch/x86/mm/physaddr.c | 8 +
arch/x86/mm/tlb.c | 419 ++++++-
arch/xtensa/include/asm/Kbuild | 1 +
fs/binfmt_elf.c | 2 +-
fs/eventfd.c | 2 +-
fs/eventpoll.c | 10 +-
fs/exec.c | 7 +
fs/file.c | 3 +-
fs/timerfd.c | 2 +-
include/asm-generic/asi.h | 149 +++
include/asm-generic/irq_regs.h | 2 +-
include/asm-generic/percpu.h | 6 +
include/asm-generic/vmlinux.lds.h | 36 +-
include/linux/arch_topology.h | 2 +-
include/linux/debug_locks.h | 4 +-
include/linux/gfp.h | 13 +-
include/linux/hrtimer.h | 2 +-
include/linux/interrupt.h | 2 +-
include/linux/jiffies.h | 4 +-
include/linux/kernel_stat.h | 4 +-
include/linux/kvm_host.h | 7 +-
include/linux/kvm_types.h | 3 +
include/linux/memcontrol.h | 3 +
include/linux/mm_types.h | 59 +
include/linux/module.h | 15 +
include/linux/notifier.h | 2 +-
include/linux/page-flags.h | 19 +
include/linux/percpu-defs.h | 39 +
include/linux/percpu.h | 8 +-
include/linux/pgtable.h | 3 +
include/linux/prandom.h | 2 +-
include/linux/profile.h | 2 +-
include/linux/rcupdate.h | 4 +-
include/linux/rcutree.h | 2 +-
include/linux/sched.h | 5 +
include/linux/sched/mm.h | 12 +
include/linux/sched/sysctl.h | 1 +
include/linux/slab.h | 68 +-
include/linux/slab_def.h | 4 +
include/linux/slub_def.h | 6 +
include/linux/vmalloc.h | 16 +-
include/trace/events/mmflags.h | 14 +-
init/main.c | 2 +-
kernel/cgroup/cgroup.c | 9 +-
kernel/cpu.c | 14 +-
kernel/entry/common.c | 6 +
kernel/events/core.c | 25 +-
kernel/exit.c | 2 +
kernel/fork.c | 69 +-
kernel/freezer.c | 2 +-
kernel/irq_work.c | 6 +-
kernel/locking/lockdep.c | 14 +-
kernel/module-internal.h | 1 +
kernel/module.c | 210 +++-
kernel/panic.c | 2 +-
kernel/printk/printk.c | 4 +-
kernel/profile.c | 4 +-
kernel/rcu/srcutree.c | 3 +-
kernel/rcu/tree.c | 12 +-
kernel/rcu/update.c | 4 +-
kernel/sched/clock.c | 2 +-
kernel/sched/core.c | 23 +-
kernel/sched/cpuacct.c | 10 +-
kernel/sched/cpufreq.c | 3 +-
kernel/sched/cputime.c | 4 +-
kernel/sched/fair.c | 7 +-
kernel/sched/loadavg.c | 2 +-
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 25 +-
kernel/sched/topology.c | 28 +-
kernel/smp.c | 26 +-
kernel/softirq.c | 5 +-
kernel/time/hrtimer.c | 4 +-
kernel/time/jiffies.c | 8 +-
kernel/time/ntp.c | 30 +-
kernel/time/tick-common.c | 6 +-
kernel/time/tick-internal.h | 6 +-
kernel/time/tick-sched.c | 4 +-
kernel/time/timekeeping.c | 10 +-
kernel/time/timekeeping.h | 2 +-
kernel/time/timer.c | 4 +-
kernel/trace/ring_buffer.c | 5 +-
kernel/trace/trace.c | 4 +-
kernel/trace/trace_preemptirq.c | 2 +-
kernel/trace/trace_sched_switch.c | 4 +-
kernel/tracepoint.c | 2 +-
kernel/watchdog.c | 12 +-
lib/debug_locks.c | 5 +-
lib/irq_regs.c | 2 +-
lib/radix-tree.c | 6 +-
lib/random32.c | 3 +-
mm/init-mm.c | 2 +
mm/internal.h | 3 +
mm/memcontrol.c | 37 +-
mm/memory.c | 4 +-
mm/page_alloc.c | 204 +++-
mm/percpu-internal.h | 23 +-
mm/percpu-km.c | 5 +-
mm/percpu-vm.c | 57 +-
mm/percpu.c | 273 ++++-
mm/slab.c | 42 +-
mm/slab.h | 166 ++-
mm/slab_common.c | 461 ++++++-
mm/slub.c | 140 ++-
mm/sparse.c | 4 +-
mm/util.c | 3 +-
mm/vmalloc.c | 193 ++-
net/core/skbuff.c | 2 +-
net/core/sock.c | 2 +-
security/Kconfig | 12 +
tools/perf/builtin-kmem.c | 2 +
virt/kvm/coalesced_mmio.c | 2 +-
virt/kvm/eventfd.c | 5 +-
virt/kvm/kvm_main.c | 61 +-
211 files changed, 5727 insertions(+), 959 deletions(-)
create mode 100644 arch/x86/include/asm/asi.h
create mode 100644 arch/x86/mm/asi.c
create mode 100644 include/asm-generic/asi.h

--
2.35.1.473.g83b2b277ed-goog