Re: 2.0.13 Sockets Stuck on close

Eric Schenk (schenk@cs.toronto.edu)
Wed, 21 Aug 1996 16:11:02 -0400


Christoph Lameter <clameter@fuller.edu> writes:
>schenk>>Does anyone know how to resolve these problems?
>schenk>
>schenk>Not yet, I still haven't got enough information to figure it out,
>schenk>and I can't reproduce it yet. If you can come up with a formula
>schenk>for me to reproduce this, then maybe I can track it down a little faster.
>schenk>Also, if I can get to the point where I can make a guess at what is happening
>schenk>I might be able to give you some code to instrument the kernel and
>schenk>try help track down the problem from your end.

>The problem is that these things come up sporadically. Its been a couple
>of months now. The issue seems to be timing dependant.
>
>But a socket should never be actually STUCK in CLOSE. There should be a
>timeout right?

Yes, there should be a timeout. The fact that it is loosing the timeout
means there is either a bug in the interupt masking, or we've overlooked
something on the state transition path and forgot to set the timer in
some circumstance or another. I've been looking at this one for a while
now, and I just can't pin it down. It may be related to some other
timing dependent bugs that I suspect are lurking but don't actually
cause noticible problems for anyone (other than error messages).
I may have some new ideas on how to track this one down, but I'm not sure yet.
BTW, what does "netstat -not" show for the stuck sockets? Does the output
change from one invocation to the next? Does the stuck socket eventually
disappear, or does it stick around until a reboot?

>As I have reported earlier: Telnet sessions get slower and slower until
>they come to a standstill. SendQ is showing a couple of kilobytes to
>be transferred. A ping usually gets the session going again. Also
>starting up another telnet session to the machine showing the stalling
>runs at full speed. This is across a PPP Link with a 28.8K Modem between
>two (or three) machines running 2.0.12/13 with the Debian 1.1
>Distribution.

Hmm. Can you take a tcpdump of a telnet session and mark the slowdown
times for me? I can't even begin to guess what is going on here.
It _sounds_ like an MTU mismatch problem, but that should not be
possible with a PPP link. Hmm, while I'm at it, what kind of Modem is it?
I've seen at least one USR clone that caused me a similar problem.
In any case, you might want to check how often your modem is retransmitting
(Your modem should have some kind of self diagnosis output for the last
session. Check the modem manual for the exact command, it changes from
one brand to the next.)

>It also leads to flaky network behaviour because pppd's sometimes just
>start looping. They are not part of the kernel network code true.

Can you be more specific about what you mean here? What observable
behavior does pppd exhibit?

>I tried to strace the pppd's but I did not get any output. How can I
>further observe what is going on?

How did you attempt to run the strace? pppd does a fork just after
starting so it's a bit difficult to run strace on it directly.
Try attaching strace to the running pppd process after it starts.
(Check the strace manual page for how to do that.)

>Could you tell me how to gain more information about the situation if it
>happens again? What can I look at except at the "netstat -t"?

Running tcpdump on the ppp interface will give you the traffic
that is actually going over the link. This will be much more useful
for telling what is going on.
It is also possible to run telnet with the "-d" flag to turn on
some kernel debugging messages. These may or may not be useful.

-- eric

---------------------------------------------------------------------------
Eric Schenk www: http://www.cs.toronto.edu/~schenk
Department of Computer Science email: schenk@cs.toronto.edu
University of Toronto