Re: [PATCH v4 09/19] ARM: LPAE: Page table maintenance for the3-level format

From: Catalin Marinas
Date: Thu Feb 03 2011 - 17:00:19 EST


On 3 February 2011 17:56, Russell King - ARM Linux
<linux@xxxxxxxxxxxxxxxx> wrote:
> On Mon, Jan 24, 2011 at 05:55:51PM +0000, Catalin Marinas wrote:
>> The patch also introduces the L_PGD_SWAPPER flag to mark pgd entries
>> pointing to pmd tables pre-allocated in the swapper_pg_dir and avoid
>> trying to free them at run-time. This flag is 0 with the classic page
>> table format.
>
> This shouldn't be necessary.

I tried hard to find a simple way around this but couldn't, so any
suggestion is welcomed. Basically we have two situations where
pgd_alloc/pgd_free are called: (1) new user mm and (2) identity
mapping. As long as we allocate a PMD for the modules/pkmap mappings,
we need to make sure it is freed (more why this allocation is needed
below).

For (1), we can (safely?) assume that we always have a vma in the same
1GB range with the MODULES_VADDR. I suspect the stack always gets at
the top of TASK_SIZE.

For (2), there is no guarantee that this PMD is freed, so we need to
explicit freeing in pgd_free().

But we can't simply try to free the previously allocated PMD
corresponding to MODULES_VADDR. There is a situation when the user
page tables had been cleared and we get an abort for modules/pkmap. We
than copy (safely, that's only temporarily used) the corresponding
pgd_k entry (1GB) into the soon to be freed pgd. At this point
pgd_free() would try to free the PMD from swapper_pg_dir and that's
not possible.

The L_PGD_SWAPPER also comes in handy when setting up identity
mappings. Since the top PGD entries (starting with PAGE_OFFSET >>
PGDIR_SHIFT) are copied by pgd_alloc from swapper_pg_dir, we don't
want the init pgd being corrupted when PHYS_OFFSET > PAGE_OFFSET.
Hence we check L_PGD_SWAPPER and allocate another PMD if necessary.
But at some point we need to free such PMD and can't blindly try to
free the swapper_pg_dir pages.

>> diff --git a/arch/arm/mm/pgd.c b/arch/arm/mm/pgd.c
>> index 709244c..003587d 100644
>> --- a/arch/arm/mm/pgd.c
>> +++ b/arch/arm/mm/pgd.c
>> @@ -10,6 +10,7 @@
>> Â#include <linux/mm.h>
>> Â#include <linux/gfp.h>
>> Â#include <linux/highmem.h>
>> +#include <linux/slab.h>
>>
>> Â#include <asm/pgalloc.h>
>> Â#include <asm/page.h>
>> @@ -17,6 +18,14 @@
>>
>> Â#include "mm.h"
>>
>> +#ifdef CONFIG_ARM_LPAE
>> +#define __pgd_alloc() Â Â Â Âkmalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL)
>> +#define __pgd_free(pgd) Â Â Âkfree(pgd)
>> +#else
>> +#define __pgd_alloc() Â Â Â Â(pgd_t *)__get_free_pages(GFP_KERNEL, 2)
>> +#define __pgd_free(pgd) Â Â Âfree_pages((unsigned long)pgd, 2)
>> +#endif
>> +
>> Â/*
>> Â * need to get a 16k page for level 1
>> Â */
>> @@ -26,7 +35,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
>> Â Â Â pmd_t *new_pmd, *init_pmd;
>> Â Â Â pte_t *new_pte, *init_pte;
>>
>> - Â Â new_pgd = (pgd_t *)__get_free_pages(GFP_KERNEL, 2);
>> + Â Â new_pgd = __pgd_alloc();
>> Â Â Â if (!new_pgd)
>> Â Â Â Â Â Â Â goto no_pgd;
>>
>> @@ -41,12 +50,21 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
>>
>> Â Â Â clean_dcache_area(new_pgd, PTRS_PER_PGD * sizeof(pgd_t));
>>
>> +#ifdef CONFIG_ARM_LPAE
>> + Â Â /*
>> + Â Â Â* Allocate PMD table for modules and pkmap mappings.
>> + Â Â Â*/
>> + Â Â new_pmd = pmd_alloc(mm, new_pgd + pgd_index(MODULES_VADDR), 0);
>> + Â Â if (!new_pmd)
>> + Â Â Â Â Â Â goto no_pmd;
>
> This should be a copy of the same page tables found in swapper_pg_dir -
> that's what the memcpy() above is doing.

The memcpy() above only copied between 1 and 3 entries in the pgd_k
(corresponding to 1 to 3GB kernel space). It doesn't copy the entry
corresponding to 1GB below PAGE_OFFSET that would be used by modules.
We need to allocate a new PMD for that.

The problem with the current memory map is that one PGD entry covers
1GB and the one corresponding to MODULES_VADDR is shared between user
and kernel. An alternative would be to move the kernel a bit higher
(and allow MODULES_VADDR at a 1GB boundary. The PAGE_OFFSET would be
something like 3GB + 16M, though I'm not sure what other implications
this would have.

Yet another alternative which I don't like at all is to pretend that
we only have 2 levels of page tables and always allocate 4 PMD pages +
1 PGD.

>> +#endif
>> +
>> Â Â Â if (!vectors_high()) {
>> Â Â Â Â Â Â Â /*
>> Â Â Â Â Â Â Â Â* On ARM, first page must always be allocated since it
>> Â Â Â Â Â Â Â Â* contains the machine vectors.
>> Â Â Â Â Â Â Â Â*/
>> - Â Â Â Â Â Â new_pmd = pmd_alloc(mm, new_pgd, 0);
>> + Â Â Â Â Â Â new_pmd = pmd_alloc(mm, new_pgd + pgd_index(0), 0);
>
> However, the first pmd table, and the first pte table only need to be
> present for the reason stated in the comment, and these need to be
> allocated.

The above change is harmless, I just added it for correctness.

>> Â Â Â Â Â Â Â if (!new_pmd)
>> Â Â Â Â Â Â Â Â Â Â Â goto no_pmd;
>>
>> @@ -66,7 +84,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
>> Âno_pte:
>> Â Â Â pmd_free(mm, new_pmd);
>> Âno_pmd:
>> - Â Â free_pages((unsigned long)new_pgd, 2);
>> + Â Â __pgd_free(new_pgd);
>> Âno_pgd:
>> Â Â Â return NULL;
>> Â}
>> @@ -80,20 +98,36 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd_base)
>> Â Â Â if (!pgd_base)
>> Â Â Â Â Â Â Â return;
>>
>> - Â Â pgd = pgd_base + pgd_index(0);
>> - Â Â if (pgd_none_or_clear_bad(pgd))
>> - Â Â Â Â Â Â goto no_pgd;
>> + Â Â if (!vectors_high()) {
>
> No, that's wrong. ÂAs FIRST_USER_ADDRESS is nonzero, the first pmd and
> pte table will remain allocated in spite of free_pgtables(), so this
> results in a memory leak.

I agree (and I replied to my own post earlier today), we found the
leak in testing. It is safe to remove this hunk (I had a thought that
it may trigger a bad pmd because of the identity mapping but that's
cleared already via identity_mapping_del().

>> + Â Â Â Â Â Â pgd = pgd_base + pgd_index(0);
>> + Â Â Â Â Â Â if (pgd_none_or_clear_bad(pgd))
>> + Â Â Â Â Â Â Â Â Â Â goto no_pgd;
>>
>> - Â Â pmd = pmd_offset(pgd, 0);
>> - Â Â if (pmd_none_or_clear_bad(pmd))
>> - Â Â Â Â Â Â goto no_pmd;
>> + Â Â Â Â Â Â pmd = pmd_offset(pgd, 0);
>> + Â Â Â Â Â Â if (pmd_none_or_clear_bad(pmd))
>> + Â Â Â Â Â Â Â Â Â Â goto no_pmd;
>>
>> - Â Â pte = pmd_pgtable(*pmd);
>> - Â Â pmd_clear(pmd);
>> - Â Â pte_free(mm, pte);
>> + Â Â Â Â Â Â pte = pmd_pgtable(*pmd);
>> + Â Â Â Â Â Â pmd_clear(pmd);
>> + Â Â Â Â Â Â pte_free(mm, pte);
>> Âno_pmd:
>> - Â Â pgd_clear(pgd);
>> - Â Â pmd_free(mm, pmd);
>> + Â Â Â Â Â Â pgd_clear(pgd);
>> + Â Â Â Â Â Â pmd_free(mm, pmd);
>> + Â Â }
>> Âno_pgd:
>> - Â Â free_pages((unsigned long) pgd_base, 2);
>> +#ifdef CONFIG_ARM_LPAE
>> + Â Â /*
>> + Â Â Â* Free modules/pkmap or identity pmd tables.
>> + Â Â Â*/
>> + Â Â for (pgd = pgd_base; pgd < pgd_base + PTRS_PER_PGD; pgd++) {
>> + Â Â Â Â Â Â if (pgd_none_or_clear_bad(pgd))
>> + Â Â Â Â Â Â Â Â Â Â continue;
>> + Â Â Â Â Â Â if (pgd_val(*pgd) & L_PGD_SWAPPER)
>> + Â Â Â Â Â Â Â Â Â Â continue;
>> + Â Â Â Â Â Â pmd = pmd_offset(pgd, 0);
>> + Â Â Â Â Â Â pgd_clear(pgd);
>> + Â Â Â Â Â Â pmd_free(mm, pmd);
>> + Â Â }
>> +#endif
>
> And as kernel mappings in the pgd above TASK_SIZE are supposed to be
> identical across all page tables, this shouldn't be necessary.

For tasks yes, but what about the identity mapping allocations? We
could change the name of pgd_alloc() and add another parameter to
distinguish between these two scenarios.

--
Catalin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/