Re: Odd NFS related SIGBUS (& possible fix)

From: Trond Myklebust
Date: Fri Oct 01 2010 - 14:13:05 EST

Next message: Johannes Hirte: "Re: 2.6.36-rc5 i915 regression"
Previous message: Andy Whitcroft: "overlayfs: BUG in ovl_whiteout"
In reply to: J. Bruce Fields: "Re: Odd NFS related SIGBUS (& possible fix)"
Next in thread: Trond Myklebust: "Re: Odd NFS related SIGBUS (& possible fix)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 2010-09-29 at 14:33 +1000, Benjamin Herrenschmidt wrote:
> Hi Nick, Trond !
>
> I've been tracking a problem on a heavily SMP machine here where running
> LTP "mmapstress01" spawning 64 CPUs with /tmp over NFS causes some of
> the tests to sigbus.
>
> The test itself is a relatively boring mmap+fork hammering test.
>
> What I've tracked down so far is that it seems to SIGBUS due to the
> statement in nfs_vm_page_mkwrite()
>
> mapping = page->mapping;
> if (mapping != dentry->d_inode->i_mapping)
> goto out_unlock;
>
> Which will then hit
>
> return VM_FAULT_SIGBUS;
>
> Now, while I understand the validity of that test if the mapping indeed
> -changed-, in the case I'm hitting it's been merely invalidated.
>
> IE. page->mapping is NULL, as a result of something in NFS deciding to
> go through one of the gazillion code path that invalidate mappings (in
> this case, an mtime change on the server.
>
> Now, I think -this- root cause is bogus and will need some separate
> debugging, but regardless, I don't see why at this stage, page_mkwrite()
> should cause a SIGBUS if the file has changed on the server, since we
> have pushed our our dirty mappings afaik, and so all that tells is is
> that we raced with the cache invalidation while the struct page wasn't
> locked.
>
> So I'm wondering if the right solution shouldn't be to replay the fault
> in that case instead.
>
> Now, I initially thought about returning 0; and hitting the following
> code path in __do_fault() but...
>
> if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
> lock_page(page);
> if (!page->mapping) {
> ret = 0; /* retry the fault */
> unlock_page(page);
> goto unwritable_page;
> }
>
> ... I'm not too happy about it and I'll need Nick insight here. The thing
> is that to hit there, I need to unlock the page first. That means page->mapping
> can change, and thus no longer be NULL by the time we get there, in which case
> it doesn't sound right at all to move on and make the page writable, which
> the code would do. Or am I missing something ?
>
> So my preferred fix, if I'm indeed right and this is a real bug, would be
> to do something in nfs_vm_page_mkwrite() along the lines of:
>
> lock_page(page);
> mapping = page->mapping;
> - if (mapping != dentry->d_inode->i_mapping)
> + if (mapping != dentry->d_inode->i_mapping) {
> + if (!mapping)
> + ret = 0;
> goto out_unlock;
> + }
>
> Or am I missing something ?
>
> Now regarding the other bug, unless Trond has an idea already, I think I'll start
> a separate email thread once I've collected more data. I -think- it invalidates it
> because it sees a the server mtime that is more recent than the inode, but the
> server shouldn't be touching at files, so I suspect we get confused somewhere in
> the kernel and I don't know why yet (the code path inside NFS aren't obvious to
> me at this stage).

Hi Ben,

We do want the code to be robust w.r.t. cache invalidations. In NFS,
those can happen all the time due to the unordered nature of RPC calls
and the piss-poor mtime/ctime resolution on the exported filesystems on
most Linux servers in particular.

e.g. against most Linux servers, if I'm sending out 2 WRITE requests in
parallel that both append data to the file, I can have the 2 replies
come back with the same ctime/mtime, but with different file lengths. If
those replies are processed in a different order on the client than on
the server (not necessarily due to any client error - the server may
simply have sent the replies in the wrong order), then we may end up
with an incorrect estimate of the file length, and so may be forced to
assume that something weird has happened to the file on the server.

However, it looks to me as if the right thing to do when the
page->mapping has changed would be to do the same thing as
block_page_mkwrite(), and just return VM_FAULT_NOPAGE so that the VM can
retry the fault.
IMO: We should only SIGBUS if the calls to nfs_flush_incompatible()
and/or nfs_updatepage() fail.

Cheers
Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Johannes Hirte: "Re: 2.6.36-rc5 i915 regression"
Previous message: Andy Whitcroft: "overlayfs: BUG in ovl_whiteout"
In reply to: J. Bruce Fields: "Re: Odd NFS related SIGBUS (& possible fix)"
Next in thread: Trond Myklebust: "Re: Odd NFS related SIGBUS (& possible fix)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]