Re: strange sparc64 -> i586 intermittent but reproducible NFS write errors to one and only one fs

From: Denis Vlasenko (vda@port.imtp.ilyichevsk.odessa.ua)
Date: Fri Jan 10 2003 - 01:57:52 EST


On 10 January 2003 00:56, Nix wrote:
> When I rebooted my systems into 2.4.20 (from 2.4.19), I started
> seeing EIO write() errors to files in my ext3 home directory
> (NFS-mounted, exported async).
>
> So I knocked up a test program (included below) to try to track the
> failing writes down, and got more confused.
>
> The properties of the failing writes that I've been able to determine
> thus far are as follows; look out, they're weird as hell:
>
> - the failures are definitely from write(), not open().
>
> - writes from sparc64 to one filesystem, and only one filesystem, on
> i586, both running 2.4.20, UDP NFSv3; rquotad and quotas are on,
> but I am well within my quota. (quota 3.06, nfs-utils 1.0). Writes to
> other filesystems on the same machine, even if they too are using
> ext3, even if they too have user quotas for the same user.
>
> What differs between filesystems that work and the one that fails
> I can't tell; other FSen *on the same block device* work... (the
> block device is an un-RAIDed SCSI disk.)
>
> - local writes to the same filesystem, with the same test program,
> never fail.
>
> - writes from another IA32 box (all these boxes are near-clones of
> each other as far as software is concerned) to the NFS server box
> never fail.
>
> - It happens if I mount the fs with -o soft (my default for all NFS
> mounts for robustness-in-the-presence-of-machine-failure reasons),
> but also if I mount with -o hard :(( besides, the timeouts happen
> far too fast for it to be major timeout expiry that casues the EIOs.
>
> - The failure always occurs for writes that cross the 2^21 byte
> boundary, but not all such writes fail. You seem to need to have
> done a lot of write()s before, perhaps even starting with O_TRUNC
> and write()ing like mad from there on up (the WRITES_PER_OPEN
> #define is a way to test that; I've never had a failure for a file
> opened with O_APPEND, even if it crossed the 2^21 byte boundary).
>
> - It happens whether _LARGEFILE_SOURCE / _FILE_OFFSET_BITS are
> defined or not (I'd be amazed if this affected it, actually, but
> it never hurts to check).
>
> - Despite the EIO, the write actually *succeeds* most of the time
> (perhaps not all the time; again, I'm not sure yet). In fact...
>
> - It is quite thoroughly inconsistent. If you #define REPRODUCE to 1
> in the test program and fill out sizes_to_reproduce[] with a set
> of write() sizes that have caused the error in the past, the error
> happens again, but not always:

This beast is most probably Sparc64 or 64-bit arch specific.

Try to pin down the first 2.4.20-preN where it appears.
Then inform NFS and Sparc64 folks.

If you won't get any useful hint by then, you can continue playing
patch-o-rama by reading changelog and slowly mutating last working
2.4.20-preN into first non-working pre.

--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Jan 15 2003 - 22:00:31 EST