Re: [PATCH] drop_caches: Allow unmapping pages

From: Matthew Wilcox
Date: Mon Jan 07 2019 - 14:37:24 EST


On Mon, Jan 07, 2019 at 11:25:16AM -0800, Dave Hansen wrote:
> On 1/7/19 6:15 AM, Matthew Wilcox wrote:
> > You're going to get data corruption doing this. try_to_unmap_one()
> > does:
> >
> > /* Move the dirty bit to the page. Now the pte is gone. */
> > if (pte_dirty(pteval))
> > set_page_dirty(page);
> >
> > so PageDirty() can be false above, but made true by calling
> > try_to_unmap().
>
> I don't think that PageDirty() check is _required_ for correctness. You
> can always safely try_to_unmap() no matter the state of the PTE. We
> can't lock out the hardware from setting the Dirty bit, ever.
>
> It's also just fine to unmap PageDirty() pages, as long as when the PTE
> is created, we move the dirty bit from the PTE into the 'struct page'
> (which try_to_unmap() does, as you noticed).

Right, but the very next thing the patch does is call
invalidate_complete_page(), which calls __remove_mapping() which ... oh,
re-checks PageDirty() and refuses to drop the page. So this isn't the
data corruptor that I had thought it was.

> > I also think the way you've done this is expedient at the cost of
> > efficiency and layering violations. I think you should first tear
> > down the mappings of userspace processes (which will reclaim a lot
> > of pages allocated to page tables), then you won't need to touch the
> > invalidate_inode_pages paths at all.
>
> By "tear down the mappings", do you mean something analogous to munmap()
> where the VMA goes away? Or madvise(MADV_DONTNEED) where the PTE is
> destroyed but the VMA remains?
>
> Last time I checked, we only did free_pgtables() when tearing down VMAs,
> but not for pure unmappings like reclaim or MADV_DONTNEED. I've thought
> it might be fun to make a shrinker that scanned page tables looking for
> zero'd pages, but I've never run into a system where empty page table
> pages were actually causing a noticeable problem.

A few hours ago when I thought this patch had the ordering of checking
PageDirty() the wrong way round, I had the madvise analogy in mind so
that the PTEs would get destroyed and the dirty information transferred
to the struct page first before trying to drop pages.