Re: [RFC PATCH 4/7] riscv: Implement sv48 support

From: Alex Ghiti
Date: Tue Apr 07 2020 - 01:14:18 EST



On 4/3/20 11:53 AM, Palmer Dabbelt wrote:
On Sun, 22 Mar 2020 04:00:25 PDT (-0700), alex@xxxxxxxx wrote:
By adding a new 4th level of page table, give the possibility to 64bit
kernel to address 2^48 bytes of virtual address: in practice, that roughly
offers ~160TB of virtual address space to userspace and allows up to 64TB
of physical memory.

By default, the kernel will try to boot with a 4-level page table. If the
underlying hardware does not support it, we will automatically fallback to
a standard 3-level page table by folding the new PUD level into PGDIR
level.

Early page table preparation is too early in the boot process to use any
device-tree entry, then in order to detect HW capabilities at runtime, we
use SATP feature that ignores writes with an unsupported mode. The current
mode used by the kernel is then made available through cpuinfo.

Ya, I think that's the right way to go about this. There's no reason to
rely on duplicate DT mechanisms for things the ISA defines for us.


Signed-off-by: Alexandre Ghiti <alex@xxxxxxxx>
---
Âarch/riscv/KconfigÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 6 +-
Âarch/riscv/include/asm/csr.hÂÂÂÂÂÂÂ |ÂÂ 3 +-
Âarch/riscv/include/asm/fixmap.hÂÂÂÂ |ÂÂ 1 +
Âarch/riscv/include/asm/page.hÂÂÂÂÂÂ |Â 15 +++-
Âarch/riscv/include/asm/pgalloc.hÂÂÂ |Â 36 ++++++++
Âarch/riscv/include/asm/pgtable-64.h |Â 98 ++++++++++++++++++++-
Âarch/riscv/include/asm/pgtable.hÂÂÂ |ÂÂ 5 +-
Âarch/riscv/kernel/head.SÂÂÂÂÂÂÂÂÂÂÂ |Â 37 ++++++--
Âarch/riscv/mm/context.cÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 4 +-
Âarch/riscv/mm/init.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | 128 +++++++++++++++++++++++++---
Â10 files changed, 302 insertions(+), 31 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index a475c78e66bc..79560e94cc7c 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -66,6 +66,7 @@ config RISCV
ÂÂÂÂ select ARCH_HAS_GCOV_PROFILE_ALL
ÂÂÂÂ select HAVE_COPY_THREAD_TLS
ÂÂÂÂ select HAVE_ARCH_KASAN if MMU && 64BIT
+ÂÂÂ select RELOCATABLE if 64BIT

Âconfig ARCH_MMAP_RND_BITS_MIN
ÂÂÂÂ default 18 if 64BIT
@@ -104,7 +105,7 @@ config PAGE_OFFSET
ÂÂÂÂ default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
ÂÂÂÂ default 0x80000000 if 64BIT && !MMU
ÂÂÂÂ default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
-ÂÂÂ default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
+ÂÂÂ default 0xffffc00000000000 if 64BIT && !MAXPHYSMEM_2GB

Âconfig ARCH_FLATMEM_ENABLE
ÂÂÂÂ def_bool y
@@ -148,8 +149,11 @@ config GENERIC_HWEIGHT
Âconfig FIX_EARLYCON_MEM
ÂÂÂÂ def_bool MMU

+# On a 64BIT relocatable kernel, the 4-level page table is at runtime folded
+# on a 3-level page table when sv48 is not supported.
Âconfig PGTABLE_LEVELS
ÂÂÂÂ int
+ÂÂÂ default 4 if 64BIT && RELOCATABLE
ÂÂÂÂ default 3 if 64BIT
ÂÂÂÂ default 2

I assume this means you're relying on relocation to move the kernel around
independently of PAGE_OFFSET in order to fold in the missing page table level?

Yes, relocation is needed to fallback to 3-level and move PAGE_OFFSET accordingly.

That seems reasonable, but it does impose a performance penalty as relocatable
kernels necessitate slower generated code. Additionally, there will likely be
a performance penalty due to the extra memory access on TLB misses that is
unnecessary for workloads that don't necessitate the longer VA width on
machines that support it.

Sorry, I had no time to answer your previous mail regarding performance: I have no number. But the only penalty caused by this patchset on 3-level page table is the check in page table management functions to know if 4-level is activated or not. And as you said, the extra cost of relocatable kernel that I had ignored since necessary anyway.


I think the best bet here would be to have a Kconfig option for the number of
page table levels (which could be MAXPHYSMEM or a second partially free
parameter) and then another boolean argument along the lines of "also support
machines with smaller VA widths". It seems best to turn on the largest VA
width and support for folding by default, as I assume that's what distros would
do.

I'm not a big fan of a new Kconfig option to allow people to have a 3-level page table because that implies maintaining a new kernel, even for us, having to compile 2 kernels each time we change something to mm code will be painful.

I have just reviewed Zong's KASLR patchset: he needs to parse the dtb to find out the reserved regions in order to not override one of them when copying the kernel to its new destination. And after that, he loops back to setup_vm to re-create the mapping to the new kernel.
If that's the way we take for KASLR, we can follow the same path here: boot with 4-level by default, go to check what is wanted in the device tree and if it is 3-level, loop back to setup_vm.


I didn't really look closely at the rest of this, but it generally smells OK.
The diff will need to be somewhat different for the next version, anyway :)

Thanks for doing this!

diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index 435b65532e29..3828d55af85e 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -40,11 +40,10 @@
Â#ifndef CONFIG_64BIT
Â#define SATP_PPNÂÂÂ _AC(0x003FFFFF, UL)
Â#define SATP_MODE_32ÂÂÂ _AC(0x80000000, UL)
-#define SATP_MODEÂÂÂ SATP_MODE_32
Â#else
Â#define SATP_PPNÂÂÂ _AC(0x00000FFFFFFFFFFF, UL)
Â#define SATP_MODE_39ÂÂÂ _AC(0x8000000000000000, UL)
-#define SATP_MODEÂÂÂ SATP_MODE_39
+#define SATP_MODE_48ÂÂÂ _AC(0x9000000000000000, UL)
Â#endif

Â/* Exception cause high bit - is an interrupt if set */
diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
index 42d2c42f3cc9..26e7799c5675 100644
--- a/arch/riscv/include/asm/fixmap.h
+++ b/arch/riscv/include/asm/fixmap.h
@@ -27,6 +27,7 @@ enum fixed_addresses {
ÂÂÂÂ FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
ÂÂÂÂ FIX_PTE,
ÂÂÂÂ FIX_PMD,
+ÂÂÂ FIX_PUD,
ÂÂÂÂ FIX_EARLYCON_MEM_BASE,
ÂÂÂÂ __end_of_fixed_addresses
Â};
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 691f2f9ded2f..f1a26a0690ef 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -32,11 +32,19 @@
 * physical memory (aligned on a page boundary).
 */
Â#ifdef CONFIG_RELOCATABLE
-extern unsigned long kernel_virt_addr;
Â#define PAGE_OFFSETÂÂÂÂÂÂÂ kernel_virt_addr
+
+#ifdef CONFIG_64BIT
+/*
+ * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address space so
+ * define the PAGE_OFFSET value for SV39.
+ */
+#define PAGE_OFFSET_L3ÂÂÂÂÂÂÂ 0xffffffe000000000
+#define PAGE_OFFSET_L4ÂÂÂÂÂÂÂ _AC(CONFIG_PAGE_OFFSET, UL)
+#endif /* CONFIG_64BIT */
Â#else
Â#define PAGE_OFFSETÂÂÂÂÂÂÂ _AC(CONFIG_PAGE_OFFSET, UL)
-#endif
+#endif /* CONFIG_RELOCATABLE */

Â#define KERN_VIRT_SIZEÂÂÂÂÂÂÂ -PAGE_OFFSET

@@ -104,6 +112,9 @@ extern unsigned long pfn_base;

Âextern unsigned long max_low_pfn;
Âextern unsigned long min_low_pfn;
+#ifdef CONFIG_RELOCATABLE
+extern unsigned long kernel_virt_addr;
+#endif

Â#define __pa_to_va_nodebug(x)ÂÂÂ ((void *)((unsigned long) (x) + va_pa_offset))
Â#define __va_to_pa_nodebug(x)ÂÂÂ ((unsigned long)(x) - va_pa_offset)
diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index 3f601ee8233f..540eaa5a8658 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -36,6 +36,42 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)

ÂÂÂÂ set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
Â}
+
+static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
+{
+ÂÂÂ if (pgtable_l4_enabled) {
+ÂÂÂÂÂÂÂ unsigned long pfn = virt_to_pfn(pud);
+
+ÂÂÂÂÂÂÂ set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
+ÂÂÂ }
+}
+
+static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ pud_t *pud)
+{
+ÂÂÂ if (pgtable_l4_enabled) {
+ÂÂÂÂÂÂÂ unsigned long pfn = virt_to_pfn(pud);
+
+ÂÂÂÂÂÂÂ set_p4d_safe(p4d,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
+ÂÂÂ }
+}
+
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+{
+ÂÂÂ if (pgtable_l4_enabled)
+ÂÂÂÂÂÂÂ return (pud_t *)__get_free_page(
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO);
+ÂÂÂ return NULL;
+}
+
+static inline void pud_free(struct mm_struct *mm, pud_t *pud)
+{
+ÂÂÂ if (pgtable_l4_enabled)
+ÂÂÂÂÂÂÂ free_page((unsigned long)pud);
+}
+
+#define __pud_free_tlb(tlb, pud, addr)Â pud_free((tlb)->mm, pud)
Â#endif /* __PAGETABLE_PMD_FOLDED */

Â#define pmd_pgtable(pmd)ÂÂÂ pmd_page(pmd)
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index b15f70a1fdfa..cc4ffbe778f3 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -8,16 +8,32 @@

Â#include <linux/const.h>

-#define PGDIR_SHIFTÂÂÂÂ 30
+extern bool pgtable_l4_enabled;
+
+#define PGDIR_SHIFTÂÂÂÂ (pgtable_l4_enabled ? 39 : 30)
Â/* Size of region mapped by a page global directory */
Â#define PGDIR_SIZEÂÂÂÂÂ (_AC(1, UL) << PGDIR_SHIFT)
Â#define PGDIR_MASKÂÂÂÂÂ (~(PGDIR_SIZE - 1))

+/* pud is folded into pgd in case of 3-level page table */
+#define PUD_SHIFTÂÂÂ 30
+#define PUD_SIZEÂÂÂ (_AC(1, UL) << PUD_SHIFT)
+#define PUD_MASKÂÂÂ (~(PUD_SIZE - 1))
+
Â#define PMD_SHIFTÂÂÂÂÂÂ 21
Â/* Size of region mapped by a page middle directory */
Â#define PMD_SIZEÂÂÂÂÂÂÂ (_AC(1, UL) << PMD_SHIFT)
Â#define PMD_MASKÂÂÂÂÂÂÂ (~(PMD_SIZE - 1))

+/* Page Upper Directory entry */
+typedef struct {
+ÂÂÂ unsigned long pud;
+} pud_t;
+
+#define pud_val(x)ÂÂÂÂÂ ((x).pud)
+#define __pud(x)ÂÂÂÂÂÂÂ ((pud_t) { (x) })
+#define PTRS_PER_PUDÂÂÂ (PAGE_SIZE / sizeof(pud_t))
+
Â/* Page Middle Directory entry */
Âtypedef struct {
ÂÂÂÂ unsigned long pmd;
@@ -25,7 +41,6 @@ typedef struct {

Â#define pmd_val(x)ÂÂÂÂÂ ((x).pmd)
Â#define __pmd(x)ÂÂÂÂÂÂÂ ((pmd_t) { (x) })
-
Â#define PTRS_PER_PMDÂÂÂ (PAGE_SIZE / sizeof(pmd_t))

Âstatic inline int pud_present(pud_t pud)
@@ -60,6 +75,16 @@ static inline void pud_clear(pud_t *pudp)
ÂÂÂÂ set_pud(pudp, __pud(0));
Â}

+static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
+{
+ÂÂÂ return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
+}
+
+static inline unsigned long _pud_pfn(pud_t pud)
+{
+ÂÂÂ return pud_val(pud) >> _PAGE_PFN_SHIFT;
+}
+
Âstatic inline unsigned long pud_page_vaddr(pud_t pud)
Â{
ÂÂÂÂ return (unsigned long)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
@@ -70,6 +95,15 @@ static inline struct page *pud_page(pud_t pud)
ÂÂÂÂ return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
Â}

+#define mm_pud_foldedÂÂÂ mm_pud_folded
+static inline bool mm_pud_folded(struct mm_struct *mm)
+{
+ÂÂÂ if (pgtable_l4_enabled)
+ÂÂÂÂÂÂÂ return false;
+
+ÂÂÂ return true;
+}
+
Â#define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))

Âstatic inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
@@ -90,4 +124,64 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
Â#define pmd_ERROR(e) \
ÂÂÂÂ pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))

+#define pud_ERROR(e)ÂÂÂ \
+ÂÂÂ pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
+
+static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
+{
+ÂÂÂ if (pgtable_l4_enabled)
+ÂÂÂÂÂÂÂ *p4dp = p4d;
+ÂÂÂ else
+ÂÂÂÂÂÂÂ set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
+}
+
+static inline int p4d_none(p4d_t p4d)
+{
+ÂÂÂ if (pgtable_l4_enabled)
+ÂÂÂÂÂÂÂ return (p4d_val(p4d) == 0);
+
+ÂÂÂ return 0;
+}
+
+static inline int p4d_present(p4d_t p4d)
+{
+ÂÂÂ if (pgtable_l4_enabled)
+ÂÂÂÂÂÂÂ return (p4d_val(p4d) & _PAGE_PRESENT);
+
+ÂÂÂ return 1;
+}
+
+static inline int p4d_bad(p4d_t p4d)
+{
+ÂÂÂ if (pgtable_l4_enabled)
+ÂÂÂÂÂÂÂ return !p4d_present(p4d);
+
+ÂÂÂ return 0;
+}
+
+static inline void p4d_clear(p4d_t *p4d)
+{
+ÂÂÂ if (pgtable_l4_enabled)
+ÂÂÂÂÂÂÂ set_p4d(p4d, __p4d(0));
+}
+
+static inline unsigned long p4d_page_vaddr(p4d_t p4d)
+{
+ÂÂÂ if (pgtable_l4_enabled)
+ÂÂÂÂÂÂÂ return (unsigned long)pfn_to_virt(
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ p4d_val(p4d) >> _PAGE_PFN_SHIFT);
+
+ÂÂÂ return pud_page_vaddr((pud_t) { p4d_val(p4d) });
+}
+
+#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
+
+static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+{
+ÂÂÂ if (pgtable_l4_enabled)
+ÂÂÂÂÂÂÂ return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
+
+ÂÂÂ return (pud_t *)p4d;
+}
+
Â#endif /* _ASM_RISCV_PGTABLE_64_H */
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index dce401eed1d3..06361db3f486 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -13,8 +13,7 @@

Â#ifndef __ASSEMBLY__

-/* Page Upper Directory not used in RISC-V */
-#include <asm-generic/pgtable-nopud.h>
+#include <asm-generic/pgtable-nop4d.h>
Â#include <asm/page.h>
Â#include <asm/tlbflush.h>
Â#include <linux/mm_types.h>
@@ -27,7 +26,7 @@

Â#ifdef CONFIG_MMU
Â#ifdef CONFIG_64BIT
-#define VA_BITSÂÂÂÂÂÂÂ 39
+#define VA_BITSÂÂÂÂÂÂÂ (pgtable_l4_enabled ? 48 : 39)
Â#define PA_BITSÂÂÂÂÂÂÂ 56
Â#else
Â#define VA_BITSÂÂÂÂÂÂÂ 32
diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index 1c2fbefb8786..22617bd7477f 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -113,6 +113,8 @@ clear_bss_done:
ÂÂÂÂ call setup_vm
Â#ifdef CONFIG_MMU
ÂÂÂÂ la a0, early_pg_dir
+ÂÂÂ la a1, satp_mode
+ÂÂÂ REG_L a1, (a1)
ÂÂÂÂ call relocate
Â#endif /* CONFIG_MMU */

@@ -131,24 +133,28 @@ clear_bss_done:
Â#ifdef CONFIG_MMU
Ârelocate:
Â#ifdef CONFIG_RELOCATABLE
-ÂÂÂ /* Relocate return address */
-ÂÂÂ la a1, kernel_virt_addr
-ÂÂÂ REG_L a1, 0(a1)
+ÂÂÂ /*
+ÂÂÂÂ * Relocate return address but save it in case 4-level page table is
+ÂÂÂÂ * not supported.
+ÂÂÂÂ */
+ÂÂÂ mv s1, ra
+ÂÂÂ la a3, kernel_virt_addr
+ÂÂÂ REG_L a3, 0(a3)
Â#else
-ÂÂÂ li a1, PAGE_OFFSET
+ÂÂÂ li a3, PAGE_OFFSET
Â#endif
ÂÂÂÂ la a2, _start
-ÂÂÂ sub a1, a1, a2
-ÂÂÂ add ra, ra, a1
+ÂÂÂ sub a3, a3, a2
+ÂÂÂ add ra, ra, a3

ÂÂÂÂ /* Point stvec to virtual address of intruction after satp write */
ÂÂÂÂ la a2, 1f
-ÂÂÂ add a2, a2, a1
+ÂÂÂ add a2, a2, a3
ÂÂÂÂ csrw CSR_TVEC, a2

+ÂÂÂ /* First try with a 4-level page table */
ÂÂÂÂ /* Compute satp for kernel page tables, but don't load it yet */
ÂÂÂÂ srl a2, a0, PAGE_SHIFT
-ÂÂÂ li a1, SATP_MODE
ÂÂÂÂ or a2, a2, a1

ÂÂÂÂ /*
@@ -162,6 +168,19 @@ relocate:
ÂÂÂÂ or a0, a0, a1
ÂÂÂÂ sfence.vma
ÂÂÂÂ csrw CSR_SATP, a0
+#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
+ÂÂÂ /*
+ÂÂÂÂ * If we fall through here, that means the HW does not support SV48.
+ÂÂÂÂ * We need a 3-level page table then simply fold pud into pgd level
+ÂÂÂÂ * and finally jump back to relocate with 3-level parameters.
+ÂÂÂÂ */
+ÂÂÂ call setup_vm_fold_pud
+
+ÂÂÂ la a0, early_pg_dir
+ÂÂÂ li a1, SATP_MODE_39
+ÂÂÂ mv ra, s1
+ÂÂÂ tail relocate
+#endif
Â.align 2
Â1:
ÂÂÂÂ /* Set trap vector to spin forever to help debug */
@@ -213,6 +232,8 @@ relocate:
Â#ifdef CONFIG_MMU
ÂÂÂÂ /* Enable virtual memory and relocate to virtual address */
ÂÂÂÂ la a0, swapper_pg_dir
+ÂÂÂ la a1, satp_mode
+ÂÂÂ REG_L a1, (a1)
ÂÂÂÂ call relocate
Â#endif

diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
index 613ec81a8979..152b423c02ea 100644
--- a/arch/riscv/mm/context.c
+++ b/arch/riscv/mm/context.c
@@ -9,6 +9,8 @@
Â#include <asm/cacheflush.h>
Â#include <asm/mmu_context.h>

+extern uint64_t satp_mode;
+
Â/*
 * When necessary, performs a deferred icache flush for the given MM context,
 * on the local CPU. RISC-V has no direct mechanism for instruction cache
@@ -59,7 +61,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
ÂÂÂÂ cpumask_set_cpu(cpu, mm_cpumask(next));

Â#ifdef CONFIG_MMU
-ÂÂÂ csr_write(CSR_SATP, virt_to_pfn(next->pgd) | SATP_MODE);
+ÂÂÂ csr_write(CSR_SATP, virt_to_pfn(next->pgd) | satp_mode);
ÂÂÂÂ local_flush_tlb_all();
Â#endif

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 18bbb426848e..ad96667d2ab6 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -24,6 +24,17 @@

Â#include "../kernel/head.h"

+#ifdef CONFIG_64BIT
+uint64_t satp_mode = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ?
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ SATP_MODE_39 : SATP_MODE_48;
+bool pgtable_l4_enabled = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ? false : true;
+#else
+uint64_t satp_mode = SATP_MODE_32;
+bool pgtable_l4_enabled = false;
+#endif
+EXPORT_SYMBOL(pgtable_l4_enabled);
+EXPORT_SYMBOL(satp_mode);
+
Âunsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __page_aligned_bss;
ÂEXPORT_SYMBOL(empty_zero_page);
@@ -245,9 +256,12 @@ static void __init create_pte_mapping(pte_t *ptep,

Â#ifndef __PAGETABLE_PMD_FOLDED

+pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
Âpmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
+pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
Âpmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
Âpmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
+pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);

Âstatic pmd_t *__init get_pmd_virt(phys_addr_t pa)
Â{
@@ -264,7 +278,8 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
ÂÂÂÂ if (mmu_enabled)
ÂÂÂÂÂÂÂÂ return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);

-ÂÂÂ BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
+ÂÂÂ /* Only one PMD is available for early mapping */
+ÂÂÂ BUG_ON((va - PAGE_OFFSET) >> PUD_SHIFT);

ÂÂÂÂ return (uintptr_t)early_pmd;
Â}
@@ -296,19 +311,70 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
ÂÂÂÂ create_pte_mapping(ptep, va, pa, sz, prot);
Â}

-#define pgd_next_tÂÂÂÂÂÂÂ pmd_t
-#define alloc_pgd_next(__va)ÂÂÂ alloc_pmd(__va)
-#define get_pgd_next_virt(__pa)ÂÂÂ get_pmd_virt(__pa)
+static pud_t *__init get_pud_virt(phys_addr_t pa)
+{
+ÂÂÂ if (mmu_enabled) {
+ÂÂÂÂÂÂÂ clear_fixmap(FIX_PUD);
+ÂÂÂÂÂÂÂ return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
+ÂÂÂ } else {
+ÂÂÂÂÂÂÂ return (pud_t *)((uintptr_t)pa);
+ÂÂÂ }
+}
+
+static phys_addr_t __init alloc_pud(uintptr_t va)
+{
+ÂÂÂ if (mmu_enabled)
+ÂÂÂÂÂÂÂ return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
+
+ÂÂÂ /* Only one PUD is available for early mapping */
+ÂÂÂ BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
+
+ÂÂÂ return (uintptr_t)early_pud;
+}
+
+static void __init create_pud_mapping(pud_t *pudp,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ uintptr_t va, phys_addr_t pa,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ phys_addr_t sz, pgprot_t prot)
+{
+ÂÂÂ pmd_t *nextp;
+ÂÂÂ phys_addr_t next_phys;
+ÂÂÂ uintptr_t pud_index = pud_index(va);
+
+ÂÂÂ if (sz == PUD_SIZE) {
+ÂÂÂÂÂÂÂ if (pud_val(pudp[pud_index]) == 0)
+ÂÂÂÂÂÂÂÂÂÂÂ pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
+ÂÂÂÂÂÂÂ return;
+ÂÂÂ }
+
+ÂÂÂ if (pud_val(pudp[pud_index]) == 0) {
+ÂÂÂÂÂÂÂ next_phys = alloc_pmd(va);
+ÂÂÂÂÂÂÂ pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
+ÂÂÂÂÂÂÂ nextp = get_pmd_virt(next_phys);
+ÂÂÂÂÂÂÂ memset(nextp, 0, PAGE_SIZE);
+ÂÂÂ } else {
+ÂÂÂÂÂÂÂ next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
+ÂÂÂÂÂÂÂ nextp = get_pmd_virt(next_phys);
+ÂÂÂ }
+
+ÂÂÂ create_pmd_mapping(nextp, va, pa, sz, prot);
+}
+
+#define pgd_next_tÂÂÂÂÂÂÂ pud_t
+#define alloc_pgd_next(__va)ÂÂÂ alloc_pud(__va)
+#define get_pgd_next_virt(__pa)ÂÂÂ get_pud_virt(__pa)
Â#define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)ÂÂÂ \
-ÂÂÂ create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
-#define fixmap_pgd_nextÂÂÂÂÂÂÂ fixmap_pmd
+ÂÂÂ create_pud_mapping(__nextp, __va, __pa, __sz, __prot)
+#define fixmap_pgd_nextÂÂÂÂÂÂÂ (pgtable_l4_enabled ?ÂÂÂÂÂÂÂÂÂÂÂ \
+ÂÂÂÂÂÂÂÂÂÂÂ (uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
+#define trampoline_pgd_nextÂÂÂ (pgtable_l4_enabled ?ÂÂÂÂÂÂÂÂÂÂÂ \
+ÂÂÂÂÂÂÂÂÂÂÂ (uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
Â#else
Â#define pgd_next_tÂÂÂÂÂÂÂ pte_t
Â#define alloc_pgd_next(__va)ÂÂÂ alloc_pte(__va)
Â#define get_pgd_next_virt(__pa)ÂÂÂ get_pte_virt(__pa)
Â#define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)ÂÂÂ \
ÂÂÂÂ create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
-#define fixmap_pgd_nextÂÂÂÂÂÂÂ fixmap_pte
+#define fixmap_pgd_nextÂÂÂÂÂÂÂ ((uintptr_t)fixmap_pte)
Â#endif

Âstatic void __init create_pgd_mapping(pgd_t *pgdp,
@@ -319,6 +385,13 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
ÂÂÂÂ phys_addr_t next_phys;
ÂÂÂÂ uintptr_t pgd_index = pgd_index(va);

+#ifndef __PAGETABLE_PMD_FOLDED
+ÂÂÂ if (!pgtable_l4_enabled) {
+ÂÂÂÂÂÂÂ create_pud_mapping((pud_t *)pgdp, va, pa, sz, prot);
+ÂÂÂÂÂÂÂ return;
+ÂÂÂ }
+#endif
+
ÂÂÂÂ if (sz == PGDIR_SIZE) {
ÂÂÂÂÂÂÂÂ if (pgd_val(pgdp[pgd_index]) == 0)
ÂÂÂÂÂÂÂÂÂÂÂÂ pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
@@ -449,15 +522,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)

ÂÂÂÂ /* Setup early PGD for fixmap */
ÂÂÂÂ create_pgd_mapping(early_pg_dir, FIXADDR_START,
-ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);

Â#ifndef __PAGETABLE_PMD_FOLDED
-ÂÂÂ /* Setup fixmap PMD */
+ÂÂÂ /* Setup fixmap PUD and PMD */
+ÂÂÂ if (pgtable_l4_enabled)
+ÂÂÂÂÂÂÂ create_pud_mapping(fixmap_pud, FIXADDR_START,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
ÂÂÂÂ create_pmd_mapping(fixmap_pmd, FIXADDR_START,
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
+
ÂÂÂÂ /* Setup trampoline PGD and PMD */
ÂÂÂÂ create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
-ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
+ÂÂÂ if (pgtable_l4_enabled)
+ÂÂÂÂÂÂÂ create_pud_mapping(trampoline_pud, PAGE_OFFSET,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
ÂÂÂÂ create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
Â#else
@@ -490,6 +570,29 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
ÂÂÂÂ dtb_early_pa = dtb_pa;
Â}

+#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
+/*
+ * This function is called only if the current kernel is 64bit and the HW
+ * does not support sv48.
+ */
+asmlinkage __init void setup_vm_fold_pud(void)
+{
+ÂÂÂ pgtable_l4_enabled = false;
+ÂÂÂ kernel_virt_addr = PAGE_OFFSET_L3;
+ÂÂÂ satp_mode = SATP_MODE_39;
+
+ÂÂÂ /*
+ÂÂÂÂ * PTE/PMD levels do not need to be cleared as they are common between
+ÂÂÂÂ * 3- and 4-level page tables: the 30 least significant bits
+ÂÂÂÂ * (2 * 9 + 12) are common.
+ÂÂÂÂ */
+ÂÂÂ memset(trampoline_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
+ÂÂÂ memset(early_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
+
+ÂÂÂ setup_vm(dtb_early_pa);
+}
+#endif
+
Âstatic void __init setup_vm_final(void)
Â{
ÂÂÂÂ uintptr_t va, map_size;
@@ -525,12 +628,13 @@ static void __init setup_vm_final(void)
ÂÂÂÂÂÂÂÂ }
ÂÂÂÂ }

-ÂÂÂ /* Clear fixmap PTE and PMD mappings */
+ÂÂÂ /* Clear fixmap page table mappings */
ÂÂÂÂ clear_fixmap(FIX_PTE);
ÂÂÂÂ clear_fixmap(FIX_PMD);
+ÂÂÂ clear_fixmap(FIX_PUD);

ÂÂÂÂ /* Move to swapper page table */
-ÂÂÂ csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | SATP_MODE);
+ÂÂÂ csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | satp_mode);
ÂÂÂÂ local_flush_tlb_all();
Â}
Â#else

Alex