Re: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag

From: Hugh Dickins
Date: Wed Apr 19 2023 - 00:37:53 EST


On Tue, 4 Apr 2023, Peter Xu wrote:
> On Tue, Apr 04, 2023 at 09:01:17PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@xxxxxxxxxxxx>
> >
> > Make sure that collapse_file doesn't interfere with checking the
> > uptodate flag in the page cache by only inserting hpage into the page
> > cache after it has been updated and marked uptodate. This is achieved by
> > simply not replacing present pages with hpage when iterating over the
> > target range.
> >
> > The present pages are already locked, so replacing them with the locked
> > hpage before the collapse is finalized is unnecessary. However, it is
> > necessary to stop freezing the present pages after validating them,
> > since leaving long-term frozen pages in the page cache can lead to
> > deadlocks. Simply checking the reference count is sufficient to ensure
> > that there are no long-term references hanging around that would the
> > collapse would break. Similar to hpage, there is no reason that the
> > present pages actually need to be frozen in addition to being locked.
> >
> > This fixes a race where folio_seek_hole_data would mistake hpage for
> > an fallocated but unwritten page. This race is visible to userspace via
> > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> > a similar race where pages could temporarily disappear from mincore.
> >
> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > Signed-off-by: David Stevens <stevensd@xxxxxxxxxxxx>
...
>
> Personally I don't see anything wrong with this change to resolve the dead
> lock. E.g. fast gup race right before unmapping the pgtables seems fine,
> since we'll just bail out with >3 refcounts (or fast-gup bails out by
> checking pte changes). Either way looks fine here.
>
> So far it looks good to me, but that may not mean much per the history on
> what I can overlook. It'll be always good to hear from Hugh and others.

I'm uneasy about it, and haven't let it sink in for long enough: but
haven't spotted anything wrong with it, nor experienced any trouble.

I would have much preferred David to stick with the current scheme, and
fix up seek_hole_data, and be less concerned with the mincore transients:
this patch makes a significant change that is difficult to be sure of.

I was dubious about the unfrozen "page_count(page) != 3" check (where
another task can grab a reference an instant later), but perhaps it
does serve a purpose, since we hold the page lock there: excludes
concurrent shmem reads which grab but drop page lock before copying
(though it's not clear that those do actually need excluding).

I had thought shmem was peculiar in relying on page lock while writing,
but turn out to be quite wrong about that: most filesystems rely on
page lock while writing, though I'm not sure whether that's true of
all (and it doesn't matter while collapse of non-shmem file is only
permitted on read-only).

We shall see.

Hugh