2.6.24-rc6-mm1 - git-lblnet.patch and networking horkage

From: Valdis . Kletnieks
Date: Wed Dec 26 2007 - 00:44:28 EST


On Sat, 22 Dec 2007 23:30:56 PST, Andrew Morton said:
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc6/2.6.24-rc6-mm1/

I've bisected it down this far:

kvm-ist-kaput.patch GOOD
git-lblnet.patch
git-lblnet-fixup.patch
git-leds.patch
git-libata-all.patch
git-libata-all-fix-pata_winbond-borkage.patch
git-libata-all-wtf.patch BAD

and somehow, I doubt the leds or libata trees horked up networking. ;)

Symptoms - semi-sporadic failures in making network connections. The test
case that tripped it up was the 'make test' from the Tcl 8.5 - several of the
test cases will create a listening socket, and then try to connect to it.
Under 2.6.24-rc5-mm1, it works just fine, but I'm seeing hangs under -rc6-mm1.
Doing a 'netstat -n -a -A inet -p' while it's hung shows me this:

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:34118 0.0.0.0:* LISTEN 2236/tcltest
tcp 0 1 127.0.0.1:59460 127.0.0.1:34118 SYN_SENT 2236/tcltest
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:47842 0.0.0.0:* LISTEN 2352/tcltest
tcp 0 1 127.0.0.1:46510 127.0.0.1:47842 SYN_SENT 2352/tcltest
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:47842 0.0.0.0:* LISTEN 2352/tcltest
tcp 0 1 127.0.0.1:46510 127.0.0.1:47842 SYN_SENT 2352/tcltest

Pretty consistent failure mode - a socket is in 'listen', and the connection
gets hung in 'SYN_SENT'. There's 3 outputs listed - the first one from one run
of the test case, the second 2 are some 20 seconds apart on the same run.
It's pretty obvious that if you can't complete a 3-packet handshake to loopback
in 20 seconds, something is hosed. However, it's apparently some sort of
race/timing issue, as many *other* test cases in the Tcl test tree do in fact
work OK.

I already checked, it's not a slam-dunk to just 'patch -R' as there's 3 or 4
conflicts where later patches need massaging/reverting as well.

It's a problem with both 'classic RCU' and 'preempt RCU' (that was my *first*
guess as to the cause).

Any clues/hints/advice/patches?

Attachment: pgp00000.pgp
Description: PGP signature