Re: [PATCH 1/3] nfsd: use __fput_sync() to avoid delayed closing of files.

From: Dave Chinner
Date: Fri Dec 15 2023 - 20:52:50 EST


On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote:
> On Sat, 09 Dec 2023, Chuck Lever wrote:
> > On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote:
> > > Calling fput() directly or though filp_close() from a kernel thread like
> > > nfsd causes the final __fput() (if necessary) to be called from a
> > > workqueue. This means that nfsd is not forced to wait for any work to
> > > complete. If the ->release of ->destroy_inode function is slow for any
> > > reason, this can result in nfsd closing files more quickly than the
> > > workqueue can complete the close and the queue of pending closes can
> > > grow without bounces (30 million has been seen at one customer site,
> > > though this was in part due to a slowness in xfs which has since been
> > > fixed).
> > >
> > > nfsd does not need this.
> >
> > That is technically true, but IIUC, there is only one case where a
> > synchronous close matters for the backlog problem, and that's when
> > nfsd_file_free() is called from nfsd_file_put(). AFAICT all other
> > call sites (except rename) are error paths, so there aren't negative
> > consequences for the lack of synchronous wait there...
>
> What you say is technically true but it isn't the way I see it.
>
> Firstly I should clarify that __fput_sync() is *not* a flushing close as
> you describe it below.
> All it does, apart for some trivial book-keeping, is to call ->release
> and possibly ->destroy_inode immediately rather than shunting them off
> to another thread.
> Apparently ->release sometimes does something that can deadlock with
> some kernel threads or if some awkward locks are held, so the whole
> final __fput is delay by default. But this does not apply to nfsd.
> Standard fput() is really the wrong interface for nfsd to use.
> It should use __fput_sync() (which shouldn't have such a scary name).
>
> The comment above flush_delayed_fput() seems to suggest that unmounting
> is a core issue. Maybe the fact that __fput() can call
> dissolve_on_fput() is a reason why it is sometimes safer to leave the
> work to later. But I don't see that applying to nfsd.
>
> Of course a ->release function *could* do synchronous writes just like
> the XFS ->destroy_inode function used to do synchronous reads.

What do you mean "could"? The correct word is "does".

> I don't think we should ever try to hide that by putting it in
> a workqueue. It's probably a bug and it is best if bugs are visible.

Most definitely *not* a bug.

XFS, ext4 and btrfs all call filemap_flush() from their respective
->release methods. This is required to protect user data against
loss caused by poorly written applications that overwrite user data
in an unsafe manner (i.e. the open-truncate-write-close overwrite
anti-pattern).

The btrfs flush trigger is very similar to XFS:

/*
* Set by setattr when we are about to truncate a file from a non-zero
* size to a zero size. This tries to flush down new bytes that may
* have been written if the application were using truncate to replace
* a file in place.
*/
if (test_and_clear_bit(BTRFS_INODE_FLUSH_ON_CLOSE,
&BTRFS_I(inode)->runtime_flags))
filemap_flush(inode->i_mapping);

XFS started doing this in 2006, ext4 in 2008, and git will tell you
when btrfs picked this up, too. IOWs, we've been doing writeback
from ->release for a *very long time*.

> Note that the XFS ->release function does call filemap_flush() in some
> cases, but that is an async flush, so __fput_sync doesn't wait for the
> flush to complete.

"async flush" does not mean it will not block for long periods of
time, it just means it won't wait for *all* the IO to complete.
i.e. if the async flush saturates the device, bio submission will
wait for previous IO that the flush submitted own IO to complete
before it can continue flushing the data.

But wait, it gets better!

XFS, btrfs and ext4 all implement delayed allocation, which means
writeback often needs to run extent allocation transactions. In
these cases, transaction reservation can block on metadata writeback
to free up journal space. In the case of XFS, this could be tens of
thousands of metadata IOs needing to be submitted and completed!.

Then consider that extent allocation needs to search for free space
which may need to read in metadata. i.e. extent allocation will end
up submitting and waiting on synchronous read IO. Also, reading that
metadata requires memory allocation for the buffers that will store
it - memory allocation can also block on IO and other subsystems to
free up memory.

Even less obvious is the stack usage issues calling ->release from
arbitrary code entails. The filesystem writeback stack is -deep-.

Remember all the problems we used to have with ->writepage() being
called from direct memory reclaim and so putting the writeback path
at arbitrary depths in the stack and then running out of stack
space? We really don't want to go back to the bad old days where
filesystem write paths can be entered from code that has already
consumed most of the stack space....

Hence, IMO, __fput_sync() is something that needs to be very
carefully controlled and should have big scary warnings on it. We
really don't want it to be called from just anywhere...

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx