Re: NFS and kernel 2.6.x

From: Jamie Lokier
Date: Fri Apr 16 2004 - 13:49:39 EST


Trond Myklebust wrote:
> > Perhaps because 2.6 changes the UDP retransmit model for NFS, to
> > estimate the round-trip time and thus retransmit faster than 2.4
> > would. Sometimes _much_ faster: I observed retransmits within a few
> > hundred microseconds.
>
> Retransmits within a few 100 microsecond should no longer be occurring.
> Have you redone those measurements with a more recent kernel?

No, not since I sent you the packet trace from a 2.5 kernel that
wasn't working with "soft". I took your advice and stopped using
"soft". It causes the obvious problem when I (rarely) turn off the
server, otherwise it's been fine and I'm using 2.6.5 now, still fine
(with "soft" not being used).

> 2.6.x and 2.4.x should have pretty much the same code for RTO
> estimation.
>
> In fact pretty much all the 2.4.x and 2.6.x RPC code is shared. The one
> difference is that 2.6.x uses zero copy writes.
>
> > There was also a problem with late 2.5 clients and "soft" NFS mounts.
> > Requests would timeout after a fixed number of retransmits, which on a
> > LAN could be after a few milliseconds due to round-trip estimation and
> > fast server response. Then when an I/O on the server took longer,
> > e.g. due to a disk seek or contention, the client would timeout and
> > abort requests. 2.4 doesn't have this problem with "soft" due to the
> > longer, fixed retransmit timeout. I don't know if it is fixed in
> > current 2.6 kernels - but you can avoid it by not using "soft" anyway.
>
> Or changing the default value of "retrans" to something more sane. As
> usual, Linux has a default that is lower than on any other platform.

If few-100-microsecond retransmits no longer occur, perhaps it's no
longer relevant.

The problem I saw with "soft" was that the retransmit time was quite a
good estimate of the server response time. That part was fine, nice
even. But then the server response latency would increase by a factor
of 10000 (ten thousand) due to normal disk I/O activity (compare cache
response with disk response on a busy disk), and of course 3
retransmits doubling each time is not adequate to cover that. 2.4 was
fine because the default rtt and retrans together could never get
shorter than a few seconds.

That's why I felt that iff rtt was adapting to the server response
time, then a fixed number of retransmits was no longer appropriate: a
lower bound on the time before timing out is appropriate, e.g. 3
seconds or 10 seconds or whatever.

In other words, with adaptive rtt the concept of "retrans" being a
fixed number is fundamentally flawed -- unless it's also accompanied
by a minimum timeout time. You'd need a retrans value of 20 or so for
the above perfectly normal LAN situation, but then that's far too
large on other occasions with other networks or servers.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/