Re: [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()

From: Darrick J. Wong
Date: Thu Nov 03 2022 - 22:45:12 EST


On Fri, Nov 04, 2022 at 11:32:35AM +1100, Dave Chinner wrote:
> On Thu, Nov 03, 2022 at 03:28:05PM -0700, Vishal Moola wrote:
> > On Wed, Oct 19, 2022 at 08:01:52AM +1100, Dave Chinner wrote:
> > > On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote:
> > > > Converted function to use folios throughout. This is in preparation for
> > > > the removal of find_get_pages_range_tag().
> > > >
> > > > Signed-off-by: Vishal Moola (Oracle) <vishal.moola@xxxxxxxxx>
> > > > ---
> > > > mm/page-writeback.c | 44 +++++++++++++++++++++++---------------------
> > > > 1 file changed, 23 insertions(+), 21 deletions(-)
> > > >
> > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > index 032a7bf8d259..087165357a5a 100644
> > > > --- a/mm/page-writeback.c
> > > > +++ b/mm/page-writeback.c
> > > > @@ -2285,15 +2285,15 @@ int write_cache_pages(struct address_space *mapping,
> > > > int ret = 0;
> > > > int done = 0;
> > > > int error;
> > > > - struct pagevec pvec;
> > > > - int nr_pages;
> > > > + struct folio_batch fbatch;
> > > > + int nr_folios;
> > > > pgoff_t index;
> > > > pgoff_t end; /* Inclusive */
> > > > pgoff_t done_index;
> > > > int range_whole = 0;
> > > > xa_mark_t tag;
> > > >
> > > > - pagevec_init(&pvec);
> > > > + folio_batch_init(&fbatch);
> > > > if (wbc->range_cyclic) {
> > > > index = mapping->writeback_index; /* prev offset */
> > > > end = -1;
> > > > @@ -2313,17 +2313,18 @@ int write_cache_pages(struct address_space *mapping,
> > > > while (!done && (index <= end)) {
> > > > int i;
> > > >
> > > > - nr_pages = pagevec_lookup_range_tag(&pvec, mapping, &index, end,
> > > > - tag);
> > > > - if (nr_pages == 0)
> > > > + nr_folios = filemap_get_folios_tag(mapping, &index, end,
> > > > + tag, &fbatch);
> > >
> > > This can find and return dirty multi-page folios if the filesystem
> > > enables them in the mapping at instantiation time, right?
> >
> > Yup, it will.
> >
> > > > +
> > > > + if (nr_folios == 0)
> > > > break;
> > > >
> > > > - for (i = 0; i < nr_pages; i++) {
> > > > - struct page *page = pvec.pages[i];
> > > > + for (i = 0; i < nr_folios; i++) {
> > > > + struct folio *folio = fbatch.folios[i];
> > > >
> > > > - done_index = page->index;
> > > > + done_index = folio->index;
> > > >
> > > > - lock_page(page);
> > > > + folio_lock(folio);
> > > >
> > > > /*
> > > > * Page truncated or invalidated. We can freely skip it
> > > > @@ -2333,30 +2334,30 @@ int write_cache_pages(struct address_space *mapping,
> > > > * even if there is now a new, dirty page at the same
> > > > * pagecache address.
> > > > */
> > > > - if (unlikely(page->mapping != mapping)) {
> > > > + if (unlikely(folio->mapping != mapping)) {
> > > > continue_unlock:
> > > > - unlock_page(page);
> > > > + folio_unlock(folio);
> > > > continue;
> > > > }
> > > >
> > > > - if (!PageDirty(page)) {
> > > > + if (!folio_test_dirty(folio)) {
> > > > /* someone wrote it for us */
> > > > goto continue_unlock;
> > > > }
> > > >
> > > > - if (PageWriteback(page)) {
> > > > + if (folio_test_writeback(folio)) {
> > > > if (wbc->sync_mode != WB_SYNC_NONE)
> > > > - wait_on_page_writeback(page);
> > > > + folio_wait_writeback(folio);
> > > > else
> > > > goto continue_unlock;
> > > > }
> > > >
> > > > - BUG_ON(PageWriteback(page));
> > > > - if (!clear_page_dirty_for_io(page))
> > > > + BUG_ON(folio_test_writeback(folio));
> > > > + if (!folio_clear_dirty_for_io(folio))
> > > > goto continue_unlock;
> > > >
> > > > trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
> > > > - error = (*writepage)(page, wbc, data);
> > > > + error = writepage(&folio->page, wbc, data);
> > >
> > > Yet, IIUC, this treats all folios as if they are single page folios.
> > > i.e. it passes the head page of a multi-page folio to a callback
> > > that will treat it as a single PAGE_SIZE page, because that's all
> > > the writepage callbacks are currently expected to be passed...
> > >
> > > So won't this break writeback of dirty multipage folios?
> >
> > Yes, it appears it would. But it wouldn't because its already 'broken'.
>
> It is? Then why isn't XFS broken on existing kernels? Oh, we don't
> know because it hasn't been tested?
>
> Seriously - if this really is broken, and this patchset further
> propagating the brokeness, then somebody needs to explain to me why
> this is not corrupting data in XFS.

It looks like iomap_do_writepage finds the folio size correctly

end_pos = folio_pos(folio) + folio_size(folio);

and iomap_writpage_map will map out the correct number of blocks

unsigned nblocks = i_blocks_per_folio(inode, folio);

for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {

right? The interface is dangerous because anyone who enables multipage
folios has to be aware that ->writepage can be handed a multipage folio.

(That said, the lack of mention of xfs in the testing plan doesn't give
me much confidence anyone has checked this...)

> I get it that page/folios are in transition, but passing a
> multi-page folio page to an interface that expects a PAGE_SIZE
> struct page is a pretty nasty landmine, regardless of how broken the
> higher level iteration code already might be.
>
> At minimum, it needs to be documented, though I'd much prefer that
> we explicitly duplicate write_cache_pages() as write_cache_folios()
> with a callback that takes a folio and change the code to be fully
> multi-page folio safe. Then filesystems that support folios (and
> large folios) natively can be passed folios without going through
> this crappy "folio->page, page->folio" dance because the writepage
> APIs are unaware of multi-page folio constructs.

Agree. Build the new one, move callers over, and kill the old one.

> Then you can convert the individual filesystems using
> write_cache_pages() to call write_cache_folios() one at a time,
> updating the filesystem callback to do the conversion from folio to
> struct page and checking that it an order-0 page that it has been
> handed....
>
> > The current find_get_pages_range_tag() actually has the exact same
> > issue. The current code to fill up the pages array is:
> >
> > pages[ret] = &folio->page;
> > if (++ret == nr_pages) {
> > *index = folio->index + folio_nr_pages(folio);
> > goto out;
>
> "It's already broken so we can make it more broken" isn't an
> acceptible answer....
>
> > Its not great to leave it 'broken' but its something that isn't - or at
> > least shouldn't be - creating any problems at present. And I believe Matthew
> > has plans to address them at some point before they actually become problems?
>
> You are modifying the interfaces and doing folio conversions that
> expose and propagate the brokenness. The brokeness needs to be
> either avoided or fixed and not propagated further. Doing the above
> write_cache_folios() conversion avoids the propagating the
> brokenness, adds runtime detection of brokenness, and provides the
> right interface for writeback iteration of folios.
>
> Fixing the generic writeback iterator properly is not much extra
> work, and it sets the model for filesytsems that have copy-pasted
> write_cache_pages() and then hacked it around for their own purposes
> (e.g. ext4, btrfs) to follow.
>
> -Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx