Intel Gigabit NIC (2.6.5 -> 2.6.10) Bug(?) Found

From: Justin Piszcz
Date: Sun Feb 20 2005 - 09:04:14 EST


What is this e-mail about?

Something in the kernel changed regarding the Intel e1000 driver from 2.6.5 to 2.6.10. The change resulted in thousands of errors when the NIC is receiving data. For the past two weeks I have thought about this and tried everything I could think of, it had really been pestering me. Normally, I never really looked at my ifconfig eth0, eth1 etc because I looked at it a long time ago and noticed it was just fine, this was with earlier kernels. I guess I should check my NIC statistics more often. I have tried the following to figure out why I get so many dropped packets and errors on an interface:

1] New Intel [same model] NIC.
2] Different ports in the switch.
3] New cable.
4] Switched PCI slots for the Intel Gigabit Card.
5] Switched BIOS settings/parameters to exact settings as other, identical
machine.

None of these fixed the problem. There are two machines (same model) here with GigE nics, on one there are very few (1-3) if any errors on the nic ever. The test that I used that reproduces the problem the quickest is dd if=/dev/zero of=/nfsv3/udp/file.img where the dd is on another box sending to the box that gets the RX errors on the NIC. Generally, there would be about 100 errors every 10 seconds. There are two identical machines on the network here, both with this same Intel Gigabit NIC (82541GI/PI). So one machine is running 2.6.5, the other 2.6.10, I figured it had to something in the kernel that was causing this. Therefore I grabbed ethtool and installed it and did a basic query for network setting parameters, immediately I noticed a difference, which is shown below:

* Box with no problems.
# ethtool -a eth0
Pause parameters for eth0:
Autonegotiate: on
RX: on
TX: on

* Box with NIC that generates errors, dropped packets and overrun errors.
# ethtool -a eth0
Pause parameters for eth0:
Autonegotiate: on
RX: off
TX: off

According to the manpage:

-A change the pause parameters of the specified ethernet device.

rx on|off
Specify if RX pause is enabled.

tx on|off
Specify if TX pause is enabled.


# ethtool -A eth0 rx on
# ethtool -A eth0 tx on

My machine now:

# ethtool -a eth0
Pause parameters for eth0:
Autonegotiate: on
RX: on
TX: on

Then, I re-run the dd command mentioned earlier and let it run for about ten minutes, long and behold not a single dropped packet, overrun or frame
error reported!

RX packets:6157606 errors:0 dropped:0 overruns:0 frame:0

Previously, this is what I would get after only a minute of running that dd command (I also get the errors copying files etc, dd command just speeds things up):

RX packets:6374096 errors:1419 dropped:1419 overruns:1419 frame:0

Afterwards, I no longer have any errors:

To the Intel/Kernel guys:

Question, these are identical machines for the most part, even the same nics are used in each box, why in 2.6.5 are the settings set differently than that in 2.6.10? I do not believe that it is a distribution specific error as I did not even have ethtool installed before I checked this nor do I see it any boot scripts? For now, I will just have it set the proper settings -A tx on and -A rx on but is there another way to do this or did it change in the kernel at some point?

Further investigation reveals on my main machine with an onboard Intel/PRO 1000 built-in NIC which runs on the CSA bus (A-Bit IC7-G) the pause feature is also off; HOWEVER, (2.6GHZ w/HT) this machine does not exhibit any errors!

RX packets:2471666 errors:0 dropped:0 overruns:0 frame:0
TX packets:56413066 errors:0 dropped:0 overruns:0 carrier:0

Is it a bug that it defaults to off in the newer kernel versions, as it causes MASSIVE errors on the RX side of the fence? Or should people who run gigabit interfaces on slower machines just add the ethool commands to their startup scripts to avoid the errors/etc?

There may be some parallel between speed_OF_CPU and whether it can handle it with the pause option on or off. If anyone has any idea of what the pause option is about and why it changed from 2.6.5 to 2.6.10, I'd like to know!

Thanks!


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/