Re: [PATCH 1/3] nfsd: use __fput_sync() to avoid delayed closing of files.

From: Chuck Lever
Date: Fri Dec 08 2023 - 10:02:15 EST


On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote:
> Calling fput() directly or though filp_close() from a kernel thread like
> nfsd causes the final __fput() (if necessary) to be called from a
> workqueue. This means that nfsd is not forced to wait for any work to
> complete. If the ->release of ->destroy_inode function is slow for any
> reason, this can result in nfsd closing files more quickly than the
> workqueue can complete the close and the queue of pending closes can
> grow without bounces (30 million has been seen at one customer site,
> though this was in part due to a slowness in xfs which has since been
> fixed).
>
> nfsd does not need this.

That is technically true, but IIUC, there is only one case where a
synchronous close matters for the backlog problem, and that's when
nfsd_file_free() is called from nfsd_file_put(). AFAICT all other
call sites (except rename) are error paths, so there aren't negative
consequences for the lack of synchronous wait there...

Consider at least breaking this patch into pieces so that there is
one call site changed per patch:

- If a performance regression occurs, a bisect can point out the
specific call site replacement that triggered it

- There is an opportunity to provide rationale for each call site,
because it seems to me there are two or three distinct
motivations across this patch, not all of which apply to each
replacement. More specific (perhaps naive) questions below.

- It's more surgical to drop one or two of these smaller patches
if we find a problem or that the particular change is unneeded.

Also, it would be convenient if this patch (series) introduced an
nfsd_close() that then called __fput_sync():

- The kdoc for that function could explain why __fput_sync() is
preferred and safe for nfsd

- When troubleshooting, function boundary tracing could use this
utility function to easily filter (sometimes costly) calls to
__fput_sync() from nfsd and ignore other calls to __fput_sync()

But if it turns out only one or two fput() call sites need to be
replaced, an nfsd_close() utility doesn't make sense.


I'm trying to think of a benchmark workload that I can use to
exercise this patch series. We've used fstests generic/531 in the
past for similar issues. Any others to consider?


> This quite appropriate and safe for nfsd to do
> its own close work. There is now reason that close should ever wait for
> nfsd, so no deadlock can occur.
>
> So change all fput() calls to __fput_sync(), and convert filp_close() to
> the sequence get_file();filp_close();__fput_sync().
>
> This ensure that no fput work is queued to the workqueue.
>
> Note that this removes the only in-module use of flush_fput_queue().
>
> Signed-off-by: NeilBrown <neilb@xxxxxxx>
> ---
> fs/nfsd/filecache.c | 3 ++-
> fs/nfsd/lockd.c | 2 +-
> fs/nfsd/nfs4proc.c | 4 ++--
> fs/nfsd/nfs4recover.c | 2 +-
> fs/nfsd/vfs.c | 12 ++++++------
> 5 files changed, 12 insertions(+), 11 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index ef063f93fde9..e9734c7451b5 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -283,7 +283,9 @@ nfsd_file_free(struct nfsd_file *nf)
> nfsd_file_mark_put(nf->nf_mark);
> if (nf->nf_file) {
> nfsd_file_check_write_error(nf);
> + get_file(nf->nf_file);
> filp_close(nf->nf_file, NULL);
> + __fput_sync(nf->nf_file);
> }

This effectively reverts 4c475eee0237 ("nfsd: don't fsync nfsd_files
on last close")... Won't this cause a regression? Jeff?

And this is really the only place where a flushing close can
produce back pressure on clients, isn't it?

>
> /*
> @@ -631,7 +633,6 @@ nfsd_file_close_inode_sync(struct inode *inode)
> list_del_init(&nf->nf_lru);
> nfsd_file_free(nf);
> }
> - flush_delayed_fput();
> }
>
> /**
> diff --git a/fs/nfsd/lockd.c b/fs/nfsd/lockd.c
> index 46a7f9b813e5..f9d1059096a4 100644
> --- a/fs/nfsd/lockd.c
> +++ b/fs/nfsd/lockd.c
> @@ -60,7 +60,7 @@ nlm_fopen(struct svc_rqst *rqstp, struct nfs_fh *f, struct file **filp,
> static void
> nlm_fclose(struct file *filp)
> {
> - fput(filp);
> + __fput_sync(filp);
> }

Will lock file descriptors have dirty data? This function is called
from nlm_traverse_files(), which is looping over all files in a
table, and it's called while a global mutex is held. Any data
showing this won't have a scalability impact?


> static const struct nlmsvc_binding nfsd_nlm_ops = {
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index 6f2d4aa4970d..20d60823d530 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -629,7 +629,7 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> nn->somebody_reclaimed = true;
> out:
> if (open->op_filp) {
> - fput(open->op_filp);
> + __fput_sync(open->op_filp);
> open->op_filp = NULL;
> }

Isn't this just discarding a file descriptor that wasn't needed?
What's the need for a flushing close here?


> if (resfh && resfh != &cstate->current_fh) {
> @@ -1546,7 +1546,7 @@ nfsd4_cleanup_inter_ssc(struct nfsd4_ssc_umount_item *nsui, struct file *filp,
> long timeout = msecs_to_jiffies(nfsd4_ssc_umount_timeout);
>
> nfs42_ssc_close(filp);
> - fput(filp);
> + __fput_sync(filp);
>
> spin_lock(&nn->nfsd_ssc_lock);
> list_del(&nsui->nsui_list);
> diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
> index 3509e73abe1f..f8f0112fd9f5 100644
> --- a/fs/nfsd/nfs4recover.c
> +++ b/fs/nfsd/nfs4recover.c
> @@ -561,7 +561,7 @@ nfsd4_shutdown_recdir(struct net *net)
>
> if (!nn->rec_file)
> return;
> - fput(nn->rec_file);
> + __fput_sync(nn->rec_file);
> nn->rec_file = NULL;
> }

What's the justification for a flushing close in this path?


>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index fbbea7498f02..15a811229211 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -879,7 +879,7 @@ __nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, umode_t type,
>
> host_err = ima_file_check(file, may_flags);
> if (host_err) {
> - fput(file);
> + __fput_sync(file);
> goto out;
> }

AFAICT __nfsd_open is used only for creating a file descriptor for
either an NLM request or for handling a readdir. In fact, I'm not
even sure why there is an ima_file_check() call site here. IMO
there doesn't need to be a flushing close in this error flow.


> @@ -1884,10 +1884,10 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
> fh_drop_write(ffhp);
>
> /*
> - * If the target dentry has cached open files, then we need to try to
> - * close them prior to doing the rename. Flushing delayed fput
> - * shouldn't be done with locks held however, so we delay it until this
> - * point and then reattempt the whole shebang.
> + * If the target dentry has cached open files, then we need to
> + * try to close them prior to doing the rename. Final fput
> + * shouldn't be done with locks held however, so we delay it
> + * until this point and then reattempt the whole shebang.
> */
> if (close_cached) {
> close_cached = false;
> @@ -2141,7 +2141,7 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
> if (err == nfserr_eof || err == nfserr_toosmall)
> err = nfs_ok; /* can still be found in ->err */
> out_close:
> - fput(file);
> + __fput_sync(file);

Do we expect a directory file descriptor to need a flushing close?


> out:
> return err;
> }
> --
> 2.43.0
>
>

--
Chuck Lever