Two network problems

David desJardins (desj@google.com)
Mon, 29 Nov 1999 16:59:13 -0800


Google has had two more kernel-related problems recently. Both of these
have been crashing our webservers to the point that we have to reboot
them. This can be a big problem for us---we don't have very many
webservers, and our architecture relies on them staying up.

I can send out more Google t-shirts to anyone who can help. (I'm still
working the last responses, regarding our kernel oops messages---most
seemed to agree those are likely due to hardware memory errors, but
confirming this is difficult.)

Problem #1:

Nov 28 17:15:13 f36 kernel: eth0: Transmit timed out: status 0050 0090
at 270892504/270892519 command 000c0000.
Nov 28 17:15:13 f36 kernel: eth0: Trying to restart the transmitter...
Nov 28 17:15:18 f36 kernel: eth0: Transmit timed out: status 0050 0090
at 270892504/270892519 command 000c0000.
Nov 28 17:15:18 f36 kernel: eth0: Trying to restart the transmitter...
Nov 28 17:15:23 f36 kernel: eth0: Transmit timed out: status 0050 0090
at 270892504/270892519 command 000c0000.
Nov 28 17:15:23 f36 kernel: eth0: Trying to restart the transmitter...

We get an infinite stream of these messages, about 5 seconds apart.
When this happens no ethernet connections can be made (no ping response
either). Since we have no access to most of these machines except
through the ethernet interface, when the ethernet isn't working we have
no way to fix it except by rebooting the machine.

The machines have Intel EEpro 100 ethernet adapters. The problem seems
to be triggered by heavy cpu load on the individual machine (there's no
correlation between different machines).

Problem #2:

Nov 29 10:58:53 f36 kernel: possible SYN flooding on port 80. Sending cookies.
Nov 29 11:00:13 f36 kernel: possible SYN flooding on port 80. Sending cookies.
Nov 29 11:00:36 f36 kernel: dst cache overflow
Nov 29 11:00:36 f36 last message repeated 9 times
Nov 29 11:00:41 f36 kernel: NET: 1043 messages suppressed.
Nov 29 11:00:41 f36 kernel: dst cache overflow
Nov 29 11:00:46 f36 kernel: NET: 1131 messages suppressed.
Nov 29 11:00:46 f36 kernel: dst cache overflow
Nov 29 11:00:51 f36 kernel: NET: 798 messages suppressed.

And so on. It's still possible to connect to these machines when this
is going on, but certainly not well. It's possible that this is a real
SYN flood, but I'm not convinced.

The kernel is 2.2.13 with Alan Cox's syncookie patch. The problem seems
to be triggered by external network traffic of some sort---it happens on
all of our web servers at once, perhaps because when some of them go
down as a result, the others get overloaded. (The webservers share our
traffic, via an Alteon load-balancing switch.)

Thanks in advance for your comments.

-- David desJardins

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/