Re: [PATCH v2] x86/mm/pti: in pti_clone_pgtable(), increase addr properly

From: Peter Zijlstra
Date: Wed Aug 21 2019 - 06:10:29 EST


On Tue, Aug 20, 2019 at 01:23:14PM -0700, Song Liu wrote:
> Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr.
> This behavior changes after the 32-bit support: pti_clone_pgtable()
> increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by
> PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because
> addr may not be PUD_SIZE/PMD_SIZE aligned.
>
> Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE
> in these two cases.

So the patch is fine, ACK on that, but that still leaves us with the
puzzle of why this didn't explode mightily and the story needs a little
more work.

> The following explains how we debugged this issue:
>
> We use huge page for hot text and thus reduces iTLB misses. As we
> benchmark 5.2 based kernel (vs. 4.16 based), we found ~2.5x more
> iTLB misses.
>
> To figure out the issue, I use a debug patch that dumps page table for
> a pid. The following are information from the workload pid.
>
> For the 4.16 based kernel:
>
> host-4.16 # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid
> 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd
> 0xffffffff81a00000-0xffffffff81c00000 2M ro PSE x pmd
>
> For the 5.2 based kernel before this patch:
>
> host-5.2-before # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid
> 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd
>
> The 8MB text in pmd is from user space. 4.16 kernel has 1 pmd for the
> irq entry table; while 4.16 kernel doesn't have it.
>
> For the 5.2 based kernel after this patch:
>
> host-5.2-after # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid
> 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd
> 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
>
> So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD
> in 4.16 kernel.

This basically gives rise to more questions than it provides answers.
You seem to have 'forgotten' to provide the equivalent mappings on the
two older kernels. The fact that they're not PMD is evident, but it
would be very good to know what is mapped, and what -- if anything --
lives in the holes we've (accidentally) created.

Can you please provide more complete mappings? Basically provide the
whole cpu_entry_area mapping.

> This further reduces iTLB miss rate

What you're saying is that by using PMDs, we reduce 4K iTLB usage. But
we increase 2M iTLB usage, but for your workload this works out
favourably (a quick look at the PMU event tables for SKL didn't show me
separate 4K/2M iTLB counters :/).

> Cc: stable@xxxxxxxxxxxxxxx # v4.19+
> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
> Reviewed-by: Rik van Riel <riel@xxxxxxxxxxx>
> Signed-off-by: Song Liu <songliubraving@xxxxxx>
> Cc: Joerg Roedel <jroedel@xxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
> Cc: Andy Lutomirski <luto@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> ---
> arch/x86/mm/pti.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..1337494e22ef 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,13 +330,13 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> - addr += PUD_SIZE;
> + addr = round_up(addr + 1, PUD_SIZE);
> continue;
> }
>
> pmd = pmd_offset(pud, addr);
> if (pmd_none(*pmd)) {
> - addr += PMD_SIZE;
> + addr = round_up(addr + 1, PMD_SIZE);
> continue;
> }
>
> --
> 2.17.1
>