Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables

From: Dave Hansen
Date: Wed Nov 01 2017 - 13:31:57 EST


On 11/01/2017 09:08 AM, Linus Torvalds wrote:
> On Tue, Oct 31, 2017 at 4:44 PM, Dave Hansen
> <dave.hansen@xxxxxxxxxxxxxxx> wrote:
>> On 10/31/2017 04:27 PM, Linus Torvalds wrote:
>>> (c) am I reading the code correctly, and the shadow page tables are
>>> *completely* duplicated?
>>>
>>> That seems insane. Why isn't only tyhe top level shadowed, and
>>> then lower levels are shared between the shadowed and the "kernel"
>>> page tables?
>>
>> There are obviously two PGDs. The userspace half of the PGD is an exact
>> copy so all the lower levels are shared. The userspace copying is
>> done via the code we add to native_set_pgd().
>
> So the thing that made me think you do all levels was that confusing
> kaiser_pagetable_walk() code (and to a lesser degree
> get_pa_from_mapping()).
>
> That code definitely walks and allocates all levels.
>
> So it really doesn't seem to be just sharing the top page table entry.

Yeah, they're quite lightly commented and badly named now that I go look
at them.

get_pa_from_mapping() should be called something like
get_pa_from_kernel_map(). Its job is to look at the main (kernel) page
tables and go get an address from there. It's only ever called on
kernel addresses.

kaiser_pagetable_walk() should probably be
kaiser_shadow_pagetable_walk(). Its job is to walk the shadow copy and
find the location of a 4k PTE. You can then populate that PTE with the
address you got from get_pa_from_mapping() (or clear it in the remove
mapping case).

I've attached an update to the core patch and Documentation that should
help clear this up.

> And that worries me because that seems to be a very fundamental coherency issue.
>
> I'm assuming that this is about mapping only the individual kernel
> parts, but I'd like to get comments and clarification about that.

I assume that you're really worried about having to go two places to do
one thing, like clearing a dirty bit, or unmapping a PTE, especially
when we have to do that for userspace. Thankfully, the sharing of the
page tables (under the PGD) for userspace gets rid of most of this
nastiness.

I hope that's more clear now.
diff --git a/Documentation/x86/kaiser.txt b/Documentation/x86/kaiser.txt
index 67a70d2..5b5e9c4 100644
--- a/Documentation/x86/kaiser.txt
+++ b/Documentation/x86/kaiser.txt
@@ -1,3 +1,6 @@
+Overview
+========
+
KAISER is a countermeasure against attacks on kernel address
information. There are at least three existing, published,
approaches using the shared user/kernel mapping and hardware features
@@ -18,6 +21,35 @@ This helps ensure that side-channel attacks that leverage the
paging structures do not function when KAISER is enabled. It
can be enabled by setting CONFIG_KAISER=y

+Page Table Management
+=====================
+
+KAISER logically keeps a "copy" of the page tables which unmap
+the kernel while in userspace. The kernel manages the page
+tables as normal, but the "copying" is done with a few tricks
+that mean that we do not have to manage two full copies.
+
+The first trick is that for any any new kernel mapping, we
+presume that we do not want it mapped to userspace. That means
+we normally have no copying to do. We only copy the kernel
+entries over to the shadow in response to a kaiser_add_*()
+call which is rare.
+
+For a new userspace mapping, the kernel makes the entries in
+its page tables like normal. The only difference is when the
+kernel makes entries in the top (PGD) level. In addition to
+setting the entry in the main kernel PGD, a copy if the entry
+is made in the shadow PGD.
+
+PGD entries always point to another page table. Two PGD
+entries pointing to the same thing gives us shared page tables
+for all the lower entries. This leaves a single, shared set of
+userspace page tables to manage. One PTE to lock, one set set
+of accessed bits, dirty bits, etc...
+
+Overhead
+========
+
Protection against side-channel attacks is important. But,
this protection comes at a cost:

diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 57f7637..cde9014 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -49,9 +49,21 @@
static DEFINE_SPINLOCK(shadow_table_allocation_lock);

/*
+ * This is a generic page table walker used only for walking kernel
+ * addresses. We use it too help recreate the "shadow" page tables
+ * which are used while we are in userspace.
+ *
+ * This can be called on any kernel memory addresses and will work
+ * with any page sizes and any types: normal linear map memory,
+ * vmalloc(), even kmap().
+ *
+ * Note: this is only used when mapping new *kernel* entries into
+ * the user/shadow page tables. It is never used for userspace
+ * addresses.
+ *
* Returns -1 on error.
*/
-static inline unsigned long get_pa_from_mapping(unsigned long vaddr)
+static inline unsigned long get_pa_from_kernel_map(unsigned long vaddr)
{
pgd_t *pgd;
p4d_t *p4d;
@@ -59,6 +71,8 @@ static inline unsigned long get_pa_from_mapping(unsigned long vaddr)
pmd_t *pmd;
pte_t *pte;

+ WARN_ON_ONCE(vaddr < PAGE_OFFSET);
+
pgd = pgd_offset_k(vaddr);
/*
* We made all the kernel PGDs present in kaiser_init().
@@ -111,13 +125,19 @@ static inline unsigned long get_pa_from_mapping(unsigned long vaddr)
}

/*
- * This is a relatively normal page table walk, except that it
- * also tries to allocate page tables pages along the way.
+ * Walk the shadow copy of the page tables (optionally) trying to
+ * allocate page table pages on the way down. Does not support
+ * large pages since the data we are mapping is (generally) not
+ * large enough or aligned to 2MB.
+ *
+ * Note: this is only used when mapping *new* kernel data into the
+ * user/shadow page tables. It is never used for userspace data.
*
* Returns a pointer to a PTE on success, or NULL on failure.
*/
#define KAISER_WALK_ATOMIC 0x1
-static pte_t *kaiser_pagetable_walk(unsigned long address, unsigned long flags)
+static pte_t *kaiser_shadow_pagetable_walk(unsigned long address,
+ unsigned long flags)
{
pmd_t *pmd;
pud_t *pud;
@@ -207,11 +227,11 @@ int kaiser_add_user_map(const void *__start_addr, unsigned long size,
unsigned long target_address;

for (; address < end_addr; address += PAGE_SIZE) {
- target_address = get_pa_from_mapping(address);
+ target_address = get_pa_from_kernel_map(address);
if (target_address == -1)
return -EIO;

- pte = kaiser_pagetable_walk(address, false);
+ pte = kaiser_shadow_pagetable_walk(address, false);
/*
* Errors come from either -ENOMEM for a page
* table page, or something screwy that did a
@@ -348,7 +368,7 @@ void kaiser_remove_mapping(unsigned long start, unsigned long size)
* context. This should not do any allocations because we
* should only be walking things that are known to be mapped.
*/
- pte_t *pte = kaiser_pagetable_walk(addr, KAISER_WALK_ATOMIC);
+ pte_t *pte = kaiser_shadow_pagetable_walk(addr, KAISER_WALK_ATOMIC);

/*
* We are removing a mapping that shoud