Re: [PATCH/RFC] NFS: add nostatflush mount option.

From: Trond Myklebust
Date: Thu Dec 21 2017 - 22:17:51 EST


On Fri, 2017-12-22 at 09:35 +1100, NeilBrown wrote:
> On Thu, Dec 21 2017, Trond Myklebust wrote:
>
> > On Fri, 2017-12-22 at 07:59 +1100, NeilBrown wrote:
> > > On Thu, Dec 21 2017, Trond Myklebust wrote:
> > >
> > > > On Thu, 2017-12-21 at 10:39 -0500, Chuck Lever wrote:
> > > > > Hi Neil-
> > > > >
> > > > >
> > > > > > On Dec 20, 2017, at 9:57 PM, NeilBrown <neilb@xxxxxxxx>
> > > > > > wrote:
> > > > > >
> > > > > >
> > > > > > When an i_op->getattr() call is made on an NFS file
> > > > > > (typically from a 'stat' family system call), NFS
> > > > > > will first flush any dirty data to the server.
> > > > > >
> > > > > > This ensures that the mtime reported is correct and stable,
> > > > > > but has a performance penalty. 'stat' is normally thought
> > > > > > to be a quick operation, and imposing this cost can be
> > > > > > surprising.
> > > > >
> > > > > To be clear, this behavior is a POSIX requirement.
> > > > >
> > > > >
> > > > > > I have seen problems when one process is writing a large
> > > > > > file and another process performs "ls -l" on the containing
> > > > > > directory and is blocked for as long as it take to flush
> > > > > > all the dirty data to the server, which can be minutes.
> > > > >
> > > > > Yes, a well-known annoyance that cannot be addressed
> > > > > even with a write delegation.
> > > > >
> > > > >
> > > > > > I have also seen a legacy application which frequently
> > > > > > calls
> > > > > > "fstat" on a file that it is writing to. On a local
> > > > > > filesystem (and in the Solaris implementation of NFS) this
> > > > > > fstat call is cheap. On Linux/NFS, the causes a noticeable
> > > > > > decrease in throughput.
> > > > >
> > > > > If the preceding write is small, Linux could be using
> > > > > a FILE_SYNC write, but Solaris could be using UNSTABLE.
> > > > >
> > > > >
> > > > > > The only circumstances where an application calling
> > > > > > 'stat()'
> > > > > > might get an mtime which is not stable are times when some
> > > > > > other process is writing to the file and the two processes
> > > > > > are not using locking to ensure consistency, or when the
> > > > > > one
> > > > > > process is both writing and stating. In neither of these
> > > > > > cases is it reasonable to expect the mtime to be stable.
> > > > >
> > > > > I'm not convinced this is a strong enough rationale
> > > > > for claiming it is safe to disable the existing
> > > > > behavior.
> > > > >
> > > > > You've explained cases where the new behavior is
> > > > > reasonable, but do you have any examples where the
> > > > > new behavior would be a problem? There must be a
> > > > > reason why POSIX explicitly requires an up-to-date
> > > > > mtime.
> > > > >
> > > > > What guidance would nfs(5) give on when it is safe
> > > > > to specify the new mount option?
> > > > >
> > > > >
> > > > > > In the most common cases where mtime is important
> > > > > > (e.g. make), no other process has the file open, so there
> > > > > > will be no dirty data and the mtime will be stable.
> > > > >
> > > > > Isn't it also the case that make is a multi-process
> > > > > workload where one process modifies a file, then
> > > > > closes it (which triggers a flush), and then another
> > > > > process stats the file? The new mount option does
> > > > > not change the behavior of close(2), does it?
> > > > >
> > > > >
> > > > > > Rather than unilaterally changing this behavior of 'stat',
> > > > > > this patch adds a "nosyncflush" mount option to allow
> > > > > > sysadmins to have applications which are hurt by the
> > > > > > current
> > > > > > behavior to disable it.
> > > > >
> > > > > IMO a mount option is at the wrong granularity. A
> > > > > mount point will be shared between applications that
> > > > > can tolerate the non-POSIX behavior and those that
> > > > > cannot, for instance.
> > > >
> > > > Agreed.
> > > >
> > > > The other thing to note here is that we now have an embryonic
> > > > statx()
> > > > system call, which allows the application itself to decide
> > > > whether
> > > > or
> > > > not it needs up to date values for the atime/ctime/mtime. While
> > > > we
> > > > haven't yet plumbed in the NFS side, the intention was always
> > > > to
> > > > use
> > > > that information to turn off the writeback flushing when
> > > > possible.
> > >
> > > Yes, if statx() were actually working, we could change the
> > > application
> > > to avoid the flush. But then if changing the application were an
> > > option, I suspect that - for my current customer issue - we could
> > > just
> > > remove the fstat() calls. I doubt they are really necessary.
> > > I think programmers often think of stat() (and particularly
> > > fstat())
> > > as
> > > fairly cheap and so they use it whenever convenient. Only NFS
> > > violates
> > > this expectation.
> > >
> > > Also statx() is only a real solution if/when it gets widely
> > > used. Will
> > > "ls -l" default to AT_STATX_DONT_SYNC ??
> > >
> > > Apart from the Posix requirement (which only requires that the
> > > timestamps be updated, not that the data be flushed), do you know
> > > of
> > > any
> > > value gained from flushing data before stat()?
> > >
> >
> > POSIX requires that timestamps change as part of the read() or
> > write()
> > system call.
>
> Does it require that they don't change at other times?

Yes.

> I see the (arguable) deviation from POSIX that I propose to be
> completely inline with the close-to-open consistency semantics that
> NFS
> already uses, which are weaker than what POSIX might suggest.
>
> As you didn't exactly answer the question, would it be fair to say
> that
> you don't know of any reason to flush-before-stat except that POSIX
> seems to require it?

The reason is to emulate POSIX semantics. That is the only reason, and
that is why when we added a statx() function, I asked that we allow the
application to specify that it doesn't need these POSIX semantics.


--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@xxxxxxxxxxxxxxx

Attachment: signature.asc
Description: This is a digitally signed message part