Re: [PATCH 0/5] MADV_FREE refactoring and fix KSM page

From: Minchan Kim
Date: Wed Oct 21 2015 - 01:11:51 EST


On Wed, Oct 21, 2015 at 01:43:53AM +0300, Kirill A. Shutemov wrote:
> On Tue, Oct 20, 2015 at 02:36:51PM -0700, Andrew Morton wrote:
> > On Tue, 20 Oct 2015 16:21:09 +0900 Minchan Kim <minchan@xxxxxxxxxx> wrote:
> >
> > >
> > > I reviewed THP refcount redesign patch and It seems below patch fixes
> > > MADV_FREE problem. It works well for hours.
> > >
> > > >From 104a0940b4c0f97e61de9fee0fd602926ff28312 Mon Sep 17 00:00:00 2001
> > > From: Minchan Kim <minchan@xxxxxxxxxx>
> > > Date: Tue, 20 Oct 2015 16:00:52 +0900
> > > Subject: [PATCH] mm: mark head page dirty in split_huge_page
> > >
> > > In thp split in old THP refcount, we mappped all of pages
> > > (ie, head + tails) to pte_mkdirty and mark PG_flags to every
> > > tail pages.
> > >
> > > But with THP refcount redesign, we can lose dirty bit in page table
> > > and PG_dirty for head page if we want to free the THP page using
> > > migration_entry.
> > >
> > > It ends up discarding head page by madvise_free suddenly.
> > > This patch fixes it by mark the head page PG_dirty when VM splits
> > > the THP page.
> > >
> > > Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
> > > ---
> > > mm/huge_memory.c | 1 +
> > > 1 file changed, 1 insertion(+)
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index adccfb48ce57..7fbbd42554a1 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -3258,6 +3258,7 @@ static void __split_huge_page(struct page *page, struct list_head *list)
> > > atomic_sub(tail_mapcount, &head->_count);
> > >
> > > ClearPageCompound(head);
> > > + SetPageDirty(head);
> > > spin_unlock_irq(&zone->lru_lock);
> > >
> > > unfreeze_page(page_anon_vma(head), head);
>
> Sorry, I've missed the email at first.
>
> > This appears to be a bugfix against Kirill's "thp: reintroduce
> > split_huge_page()"?
> >
> > Yes, __split_huge_page() is marking the tail pages dirty but forgot
> > about the head page
> >
> > You say "we can lose dirty bit in page table" but I don't see how the
> > above patch fixes that?
>
> I think the problem is in unfreeze_page_vma(), where I missed dirtying
> pte.
>
> > Why does __split_huge_page() unconditionally mark the pages dirty, btw?
> > Is it because the THP page was known to be dirty?
>
> THP doesn't have backing storage and cannot be swapped out without
> splitting, therefore always dirty. (huge zero page is exception, I guess).

It's right until now but I think we need more(e.g. is_dirty_migration_entry,
make_migration_entry(struct page *page, int write, int dirty) in terms of
MADV_FREE to keep dirty bit of pte rather than making pages dirty
unconditionally.

For example, we could call madvise_free to THP page so madvise_free clears
dirty bit of pmd without split THP pages(ie, lazy split, maybe you suggest
it, thanks!) instantly. Then, when VM tries to reclaim the THP page and
splits it, every page will be marked PG_dirty or pte_mkdirty even if
there is no write ever since then so madvise_free can never discard it
although we could.

Anyway it shouldn't be party-pooper. It could be enhanced and I will check
it.


>
> > If so, the head page already had PG_dirty, so this patch doesn't do
> > anything.
>
> PG_dirty appears on struct page as result of transferring from dirty bit
> in page tables. There's no guarantee that it's happened.
>
> > freeze_page(), unfreeze_page() and their callees desperately need some
> > description of what they're doing. Kirill, could you cook somethnig up
> > please?
>
> Minchan, could you test patch below instead?

I think it will definitely work and more right fix than mine because
it covers split_huge_page_to_list's error path(ie,

unfreeze_page(anon_vma, head);
ret = -EBUSY;
}


I will queue it to test machine.

..
Zzzz
..

After 2 hours, I don't see any problemso far but I have a question below.

>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 86924cc34bac..ea1f3805afa3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3115,7 +3115,7 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
>
> entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
> if (is_write_migration_entry(swp_entry))
> - entry = maybe_mkwrite(entry, vma);
> + entry = maybe_mkwrite(pte_mkdirty(entry), vma);

Why should we do pte_mkdiry only if is_write_migration_entry is true?
Doesn't it lose a dirty bit again if someone changes protection
from RW to R?

>
> flush_dcache_page(page);
> set_pte_at(vma->vm_mm, address, pte + i, entry);
> --
> Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/