Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2.

From: Linus Torvalds
Date: Thu Nov 13 2014 - 18:50:14 EST


On Mon, Nov 10, 2014 at 3:53 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> So I am fine with that, it's the details that confuse me. The thing
> doesn't seem to be generic enough to be used for arbitrary page
> tables, with (for example) the shifts fixed by the VM page size and
> the size of the pte entry type. Also, the levels seem to be very
> infexible, with the page table entries being the simple case, but then
> you have that "pdep" thing that seems to be just _one_ level of page
> directory.

Ok, so let me just put my money where my mouth is, and show some
example code of a tree walker that I think is actually more generic.
Sorry for the delay, I got distracted by other things, and I wanted to
write something to show what I think might be a better approach.

NOTE NOTE NOTE! I'm not saying you have to do it this way. But before
I even show the patch, let me show you the "tree descriptor" from my
stupid test-program that uses it, and hopefully that will show what
I'm really aiming for:

struct tree_walker_definition x86_64_def = {
.total_bits = 48,
.start = 0,
.end = 0x7fffffffffff,
.levels = {
{ .level_bits = 9, .lookup = pgd_lookup },
{ .level_bits = 9, .lookup = pud_lookup },
{ .level_bits = 9, .lookup = pmd_lookup },
{ .level_bits = 9, .walker = pte_walker }
}
};

so basically, the *concept* is that you can describe a real page table
by actually *describing* it. What the above does is tell you:

- the amount of bits the tables can cover (48 is four levels of 9
bits each, leaving 12 bits - 4096 bytes - for the actual pages)

- limit the range that can be walked (this isn't really all that
important, but it does, for example, mean that the walker will
fundamentally refuse to give access to the kernel mapping)

- show how the different levels work, and what their sizes are and
how you look them up or walk them, starting from the top-most.

Anyway, I think a descriptor like the above looks *understandable*. It
kind of stands on its own, even without showing the actual code.

Now, the code to actually *walk* the above tree looks like this:

struct tree_walker walk = {
.first = 4096,
.last = 4096*512*3,
.walk = show_walk,
.hole = show_hole,
.pre_walk = show_pre_walk,
.post_walk = show_post_walk,
};

walk_tree((struct tree_entry *)pgd, &x86_64_def, &walk);

ie you use the "walk_tree()" function to walk a particular tree (in
this case it's a fake page table directory in "pgd", see the details
in the stupid test-application), giving it the tree definition and the
"walk" parameters that show what should happen for particular details
(quite often hole/pre-walk/post-walk may be NULL, my test app just
shows them being called).

Now,. in addition to that, each tree description obviously needs the
functions to show how to look up the different levels ("lookup" for
moving from one level to another, and "walker" for actually walking
the last level page table knowing how "present" bits etc work.

Now, your code had a "uint64_t mask" for the present bits, which
probably works reasonably well in practice, but I really prefer to
just have that "walker" callback instead. That way the page tables can
look like *anything*, and you can walk them, without having magic
rules that there has to be a particular bit pattern that says it's
"present".

Also, my walker actually does super-pages - ie one of the mid-level
page tables could map one big area. I didn't much test it, but the
code is actually fairly straightforward, the way it's all been set up.
So it might be buggy, but it's *close*.

Now, one place we differ is on locking. I actually think that the
person who asks to walk the tree should just do the locking
themselves. You can't really walk the tree without knowing what kind
of tree it is, and so I think the caller should just do the locking.
Obviously, the tree walker itself may have some locking in the
"pre_walk/post_walk" thing and in its lookup routines, so the
description of the tree can contain some locking of its own, but I did
*not* want to make the infrastructure itself force any particular
locking strategy.

So this does something quite different from what your patch actually
did, and does that different thing very differently. It may not really
match what you are aiming for, but I'd *really* like the first
implementation of HMM that gets merged to not over-design the locking
(which I think yours did), and I want it to make *sense* (which I
don't think your patch did).

Also, please note that this *is* just an example. It has an example
user (that is just a stupid user-level toy app to show how it all is
put together), but it's not necessarily all that featureful, and it's
definitely not very tested.

But the code is actually fairly simple. But judge for yourself.

Linus
diff --git a/include/linux/walk_tables.h b/include/linux/walk_tables.h
new file mode 100644
index 000000000000..398be60e854a
--- /dev/null
+++ b/include/linux/walk_tables.h
@@ -0,0 +1,88 @@
+/*
+ * Copyright 2014 Linus Torvalds
+ *
+ * This code is distrubuted under the GPLv2
+ */
+#ifndef __LINUX_WALK_TABLE_H
+#define __LINUX_WALK_TABLE_H
+
+struct tree_entry;
+struct tree_walker_definition;
+
+/*
+ * The 'tree_level' data only describes one particular level
+ * of the tree. The upper levels are totally invisible to the
+ * user of the tree walker, since the tree walker will walk
+ * those using the tree definitions.
+ *
+ * NOTE! "struct tree_entry" is an opaque type, and is just a
+ * used as a pointer to the particular level. You can figure
+ * out which level you are at by looking at the "tree_level",
+ * but even better is to just use different "lookup()"
+ * functions for different levels, at which point the
+ * function is inherent to the level.
+ *
+ * NOTE 2! Some trees are fixed-depth, others are not. Each level
+ * has a lookup function, and can specify whether they are a
+ * terminal level. It should also fill in the "start" field of
+ * the tree_level information to point to the next level (or
+ * to the data). A NULL start is considered to be a hole.
+ *
+ * You won't see this hole in the "walk()" callback, but holes do get
+ * the pre-walk callback so that you can track holes too.
+ *
+ * If a "lookup()" function returns that it's a terminal entry and
+ * has a non-NULL "start", we'll call the "walk()" function with that
+ * tree-entry _once_, assuming it's a "superpage" that looks like
+ * a normal final tree-entry but is just much larger. The walk function
+ * can tell from the size we give it.
+ *
+ * NOTE 3! The last level normally doesn't have a (*lookup)()
+ * function at all, just a "walk" function. For that case, we'll
+ * call the tree definition "walker()" function instead of
+ * trying to look anything up, and it is supposed to call the
+ * "walk()' callback for each entry.
+ */
+struct tree_level {
+ unsigned int entry_level;
+ unsigned int nr_entries;
+ unsigned int entry_coverage;
+ unsigned long start, end;
+ struct tree_entry *entry;
+};
+
+struct tree_walker {
+ unsigned long first, last;
+ const void *data;
+ const struct tree_walker_definition *def;
+ int (*walk)(const struct tree_walker *, struct tree_entry *, unsigned long, unsigned int);
+ int (*hole)(const struct tree_walker *, const struct tree_level *);
+ int (*pre_walk)(const struct tree_walker *, const struct tree_level *);
+ int (*post_walk)(const struct tree_walker *, const struct tree_level *);
+};
+
+/*
+ * The "lookup()" function needs to return whether this is a terminal
+ * level or not.
+ */
+enum walk_tree_lookup {
+ WALK_TREE_DESCEND,
+ WALK_TREE_HOLE,
+ WALK_TREE_SUPERENTRY,
+};
+
+struct tree_walker_level_definition {
+ unsigned int level_bits;
+ enum walk_tree_lookup (*lookup)(struct tree_entry *, unsigned int, struct tree_level *);
+ int (*walker)(const struct tree_walker *, const struct tree_level *);
+};
+
+struct tree_walker_definition {
+ unsigned int total_bits;
+ unsigned long start, end;
+ struct tree_walker_level_definition levels[];
+};
+
+int walk_tree(struct tree_entry *root, const struct tree_walker_definition *, struct tree_walker *);
+
+#endif /* __LINUX_WALK_TABLE_H */
diff --git a/lib/walk_tables.c b/lib/walk_tables.c
new file mode 100644
index 000000000000..f63ac83f91d7
--- /dev/null
+++ b/lib/walk_tables.c
@@ -0,0 +1,68 @@
+/*
+ * Copyright 2014 Linus Torvalds
+ *
+ * This code is distrubuted under the GPLv2
+ */
+#include <linux/walk_tables.h>
+
+int walk_tree(struct tree_entry *root, const struct tree_walker_definition *def, struct tree_walker *walk)
+{
+ walk->def = def;
+ if (walk->first < def->start)
+ walk->first = def->start;
+ if (walk->last > def->end)
+ walk->last = def->end;
+ while (walk->first < walk->last) {
+ const struct tree_walker_level_definition *ldef = def->levels;
+ unsigned int shift = def->total_bits;
+ struct tree_level level;
+ struct tree_entry *tree = root;
+
+ for (level.entry_level = 0; ; level.entry_level++) {
+ unsigned long mask = (1ul << shift)-1;
+
+ /* Fill in the level description */
+ level.nr_entries = 1u << ldef->level_bits;
+ level.entry_coverage = 1ul << (shift - ldef->level_bits);
+ level.start = walk->first;
+ level.end = (level.start | mask)+1;
+ if (level.end > walk->last)
+ level.end = walk->last;
+
+ if (ldef->lookup) {
+ unsigned int index = level.start >> (shift - ldef->level_bits);
+ index &= level.nr_entries-1;
+
+ switch (ldef->lookup(tree, index, &level)) {
+ case WALK_TREE_DESCEND:
+ tree = level.entry;
+ shift -= ldef->level_bits;
+ ldef++;
+ continue;
+
+ case WALK_TREE_HOLE:
+ if (walk->hole)
+ walk->hole(walk, &level);
+ break;
+
+ case WALK_TREE_SUPERENTRY:
+ if (walk->walk)
+ walk->walk(walk, level.entry, level.start, level.end - level.start);
+ break;
+ }
+ } else {
+ if (walk->pre_walk)
+ walk->pre_walk(walk, &level);
+ if (walk->walk)
+ ldef->walker(walk, &level);
+ if (walk->post_walk)
+ walk->post_walk(walk, &level);
+ }
+
+ /* Ok, done with this level */
+ walk->first = level.end;
+ break;
+ }
+ }
+ return 0;
+}
diff --git a/test_tables.c b/test_tables.c
new file mode 100644
index 000000000000..b50ce33807ff
--- /dev/null
+++ b/test_tables.c
@@ -0,0 +1,101 @@
+#include <stdio.h>
+#include "include/linux/walk_tables.h"
+
+/* Fake x86-64-like definitions */
+#define PRESENT (1ul << 0)
+#define HUGEPAGE (1ul << 7)
+enum walk_tree_lookup pgd_lookup(struct tree_entry *root, unsigned int index, struct tree_level *level)
+{
+ unsigned long entry = ((unsigned long *)root)[index];
+
+ level->entry = (void *)(entry & ~0xffful);
+ if (!(entry * PRESENT))
+ return WALK_TREE_HOLE;
+ return (entry & HUGEPAGE) ? WALK_TREE_SUPERENTRY: WALK_TREE_DESCEND;
+}
+
+/* For x86-64, the different levels are the same, so we can reuse the pgd walker */
+#define pud_lookup pgd_lookup
+#define pmd_lookup pgd_lookup
+
+int pte_walker(const struct tree_walker *walk, const struct tree_level *level)
+{
+ unsigned long start = level->start;
+ unsigned long end = level->end;
+ unsigned int idx = (start >> 12) & 511;
+ unsigned long *ptep = idx + (unsigned long *)level->entry;
+
+ while (start < end) {
+ unsigned long pte = *ptep;
+ if (pte & PRESENT)
+ walk->walk(walk, (struct tree_entry *)ptep, start, level->entry_coverage);
+ ptep++;
+ start += level->entry_coverage;
+ }
+}
+
+struct tree_walker_definition x86_64_def = {
+ .total_bits = 48,
+ .start = 0,
+ .end = 0x7fffffffffff,
+ .levels = {
+ { .level_bits = 9, .lookup = pgd_lookup },
+ { .level_bits = 9, .lookup = pud_lookup },
+ { .level_bits = 9, .lookup = pmd_lookup },
+ { .level_bits = 9, .walker = pte_walker }
+ }
+};
+
+/*
+ * And this is a fake walker.
+ *
+ * NOTE! The definitions and the walker are separate entities, but the walker
+ * obviously knows what it is walking, so it can look at the data
+ */
+static int show_walk(const struct tree_walker *walk, struct tree_entry *pte, unsigned long address, unsigned int size)
+{
+ unsigned long entry = *(unsigned long *)pte;
+ printf("%08lx: %08lx (%d)\n", address, entry, size);
+ return 0;
+}
+
+static int show_hole(const struct tree_walker *walk, const struct tree_level *level)
+{
+ printf("hole at %08lx (%d)\n", level->start, level->end - level->start);
+}
+
+static int show_pre_walk(const struct tree_walker *walk, const struct tree_level *level)
+{
+ printf("pre_walk %p at %08lx (%d)\n", level->start, level->entry, level->end - level->start);
+}
+
+static int show_post_walk(const struct tree_walker *walk, const struct tree_level *level)
+{
+ printf("post_walk %p at %08lx (%d)\n", level->start, level->entry, level->end - level->start);
+}
+
+
+/*
+ * initial 1:1 mapping in fake test page tables for the first 8 pages,
+ * with page index 5 missing.
+ *
+ * And mostly empty page tables.
+ */
+unsigned long pte[512] __attribute__ ((aligned (4096))) = { 0x0001, 0x1001, 0x2001, 0x3001, 0x4001, 0, 0x6001, 0x7001 };
+unsigned long pmd[512] __attribute__ ((aligned (4096))) = { 1 + (unsigned long) pte, 0, 1 + (unsigned long) pte, };
+unsigned long pud[512] __attribute__ ((aligned (4096))) = { 1 + (unsigned long) pmd, 0, 1 + (unsigned long) pmd, };
+unsigned long pgd[512] __attribute__ ((aligned (4096))) = { 1 + (unsigned long) pud, 0, 1 + (unsigned long) pud, };
+
+int main(int argc, char **argv)
+{
+ struct tree_walker walk = {
+ .first = 4096,
+ .last = 4096*512*3,
+ .walk = show_walk,
+ .hole = show_hole,
+ .pre_walk = show_pre_walk,
+ .post_walk = show_post_walk,
+ };
+
+ walk_tree((struct tree_entry *)pgd, &x86_64_def, &walk);
+}