Re: [PATCH] x86/mm/hotplug: fix BUG_ON() after hotremove by not freeing pud v2

From: Jerome Glisse
Date: Sat Jun 24 2017 - 14:06:07 EST


On Sat, Jun 24, 2017 at 08:45:59AM +0200, Ingo Molnar wrote:
>
> * Jerome Glisse <jglisse@xxxxxxxxxx> wrote:
>
> > On Wed, Jun 07, 2017 at 09:17:06PM +0300, Kirill A. Shutemov wrote:
> > > On Wed, Jun 07, 2017 at 01:38:00PM -0400, Jerome Glisse wrote:
> > > > > On Wed, Jun 07, 2017 at 08:03:25PM +0300, Kirill A. Shutemov wrote:
> > > > > > On Wed, Jun 07, 2017 at 10:46:20AM -0400, jglisse@xxxxxxxxxx wrote:
> > > > > > > From: Jérôme Glisse <jglisse@xxxxxxxxxx>
> > > > > > >
> > > > > > > With commit af2cf278ef4f we no longer free pud so that we do not
> > > > > > > have synchronize all pgd on hotremove/vfree. But the new 5 level
> > > > > > > page table patchset reverted that for 4 level page table.
> > > > > > >
> > > > > > > This patch restore af2cf278ef4f and disable free_pud() if we are
> > > > > > > in the 4 level page table case thus avoiding BUG_ON() after hot-
> > > > > > > remove.
> > > > > > >
> > > > > > > af2cf278ef4f x86/mm/hotplug: Don't remove PGD entries in
> > > > > > > remove_pagetable()
> > > > > > >
> > > > > > > Changed since v1:
> > > > > > > - make free_pud() conditional on the number of page table
> > > > > > > level
> > > > > > > - improved commit message
> > > > > > >
> > > > > > > Signed-off-by: Jérôme Glisse <jglisse@xxxxxxxxxx>
> > > > > > > Cc: Andy Lutomirski <luto@xxxxxxxxxx>
> > > > > > > Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > > > > > > Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
> > > > > > > Cc: Logan Gunthorpe <logang@xxxxxxxxxxxx>
> > > > > > > > thus we now trigger a BUG_ON() l128 in sync_global_pgds()
> > > > > > > >
> > > > > > > > This patch remove free_pud() like in af2cf278ef4f
> > > > > > > ---
> > > > > > > arch/x86/mm/init_64.c | 11 +++++++++++
> > > > > > > 1 file changed, 11 insertions(+)
> > > > > > >
> > > > > > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > > > > > > index 95651dc..61028bc 100644
> > > > > > > --- a/arch/x86/mm/init_64.c
> > > > > > > +++ b/arch/x86/mm/init_64.c
> > > > > > > @@ -771,6 +771,16 @@ static void __meminit free_pmd_table(pmd_t
> > > > > > > *pmd_start, pud_t *pud)
> > > > > > > spin_unlock(&init_mm.page_table_lock);
> > > > > > > }
> > > > > > >
> > > > > > > +/*
> > > > > > > + * For 4 levels page table we do not want to free puds but for 5 levels
> > > > > > > + * we should free them. This code also need to change to adapt for boot
> > > > > > > + * time switching between 4 and 5 level.
> > > > > > > + */
> > > > > > > +#if CONFIG_PGTABLE_LEVELS == 4
> > > > > > > +static inline void free_pud_table(pud_t *pud_start, p4d_t *p4d)
> > > > > > > +{
> > > > > > > +}
> > > > > >
> > > > > > Just "if (CONFIG_PGTABLE_LEVELS > 4)" before calling free_pud_table(), but
> > > > > > okay -- I'll rework it anyway for boot-time switching.
> > > > >
> > > > > Err. "if (CONFIG_PGTABLE_LEVELS == 4)" obviously.
> > > >
> > > > You want me to respawn a v3 or is that good enough until you finish
> > > > boot time 5 level page table ?
> > >
> > > It doesn't matter for me. Upto Ingo.
> >
> > Andrew any news on this ? This fix a regression in 4.12 so it would be nice to
> > have this fix or similar in. I can repost a v3 without inline ie directly ifdefing
> > the callsite.
> >
> > Note that Kyrill will rework that but i think this is 4.13 material.
>
> Please don't #ifdef the call site or tweak the inlines - isn't what Kirill
> suggested:
>
> if (CONFIG_PGTABLE_LEVELS == 4)
>
> at the call site enough to fix the bug?

Right solution is if (CONFIG_PGTABLE_LEVELS == 5) at call site. I will spawn
a v3 with that instead of inline #if/#else

>
> BTW., how can this be a regression, if in v4.12 CONFIG_PGTABLE_LEVELS is always 4?

So in af2cf278ef4f we no longer free pud and no longer synchronize pgd
because if we don't free pud that is pointless. With Kirill 5 level page
table code we need to free pud when in 5 level page table but not free
p4d. The thing is Kirill didn't make the free_pud conditional on 5 level
page table. So on 4 level page table with his patches that are now in 4.12
it frees the pud ie the pgd entry and because we no longer synchronize
pgd it can trigger the BUG_ON() after hotremove as reported by few peoples
so far.

So yes this is a regression and yes people see that regression in the
not so common case of hotremove freeing a pud and then a latter hotplug
trying to add a new pud for same kernel virtual address range.


> For CONFIG_PGTABLE_LEVELS == 5 it won't work - but we don't have
> CONFIG_PGTABLE_LEVELS == 5 upstream yet.

In 4.12 there is already 5 level page table code and that is what regressed
this whole pgd entries get out of sync and trigger BUG_ON()

See Kirill f2a6a7050109e0a5c7a84c70aa6010f682b2f1ee for guilty patch.

Cheers,
Jérôme