Re: mysterious 2.0.33 crashes

Doug Ledford (dledford@dialnet.net)
Tue, 17 Feb 1998 07:37:49 -0600


Alfredo Sanjuan wrote:
>
> >(Geez, I guess I had a few issues).
>
> Sure, same here.
>
> >My hardware (currently) is a Micron P-133 with Micronics MB and 64MB EDO
> >Ram and a Quantum 3GB IDE, PCI NE2000, DEC Tulip and WD8013 NICs (cable
> >modem, 100 and 10Mbps LAN).
>
> My hardware is a P200, 64Mb, 2 x Seagate ST525205 + 1 Quantum Fireball SE, PCI
> NE2000 Clone.
>
> Perhaps the NE2000 is the problem? Some days ago, my ethernet went nuts with
> rare messages (first time in about two years of working ok). There is some
> changes in the ne.c code? Perhaps the Pentium bug workaround code?

OK...from what I've seen so far, I think people may be artificially limiting
the solution. This has happened on 486, on pentium, on isa, on pci, on ide,
on scsi. In general, there hasn't been a single uniform similarity. At
this point I would go one of two directions. First, there is the hardware
direction. There may be more than one driver causing problems. Between
2.0.29 and 2.0.33 there were updates to large numbers of hardware drivers,
including 3Com drivers, NE2000 driver, scsi drivers, and IDE driver. There
were massive changes to the core networking code. There were massive
changes to the aic7xxx driver. There were changes to the IDE driver. All
in all, it could be one, or some combination of, these changes that is
causing the problem. We've already identified a few things. First, I
suspect that the hard lockups, no ping, no nothing else, are related to the
posting Dan Hollis made concerning his kernel looping, which was taking
place in tcp_input. Without verification, I'm guessing that the other
lockups people see are similar in that they are an infinite loop somewhere
in the net code. Alan has been trying to track the odd error that causes
the tcp_recvmsg() oops, and I personally suspect this might be related. I
also suspect that it might be a driver thing, since certain hardware
combinations never seem to produce this. Some driver may be doing the wrong
thing at the wrong place, and it may be more than just one driver. As you
all recall, during Dave Miller's change over to the new socket hashing algo,
Dave also made lots of changes to individual device drivers to make sure
they returned the right values at the right times from dev_hard_start_xmit()
call (I think that was the call, in any case I recall seeing the patch float
through). Updating a large number of drivers like that could be very
difficult, and possibly something might have gotten missed in the process.
In any case, there also many more possibilities, but I think you all get my
point. Second, the problem could be related to a (or a few) specific config
options. For instance, I never enable multicast routing (although I do
enable IP multicast support), I never use PCI bridge optimizations, nor
Triton IDE support, blah, blah and so on. These could make a difference.
If everyone that has been reporting problems could please send me a copy of
their complete .config file, then I'll try and look for any similarities
(please don't send them to the list, it doesn't need that much traffic :)
When I get them in I'll set down and do a summary of idetical items from
each config.

Now, for those people that experience this problem on a regular basis,
talking about the lockups and oopses in tcp_recvmsg(), not other oopses and
problems, here are a few things I would do to start off with:

Use new IDE driver with RZ1000 and CMD640 bugfix support *only*
Disabled tagged queueing and scb paging on aic7xxx driver
Disable unneeded net options, disable large window net support, enable path
mtu discovery.
If you have access to it, try building the system with nothing but tulip
cards (not everyone can do this, but I would be *very* interested to hear if
this helps someone as that would point a finger at the old net card
driver/hardware). The reason I say tulip is because I have about 8 machines
that work as routers and news servers and other things all running nothing
but tulip hardware and since 2.0.33 was released there has not been a
*single* tcp related oops on these machines (some of which pass 400 to 500
packets per second 24 hours a day and another that has had as many as
300,000+ non-cached web hits in a single say, over 7,000,000 hits that month
on a single virtual host).

I would recommend more, but I need to take a look at the .config options in
use first, these are just generic things that might help.

Now, finally, to Ken Jordan concerning his post. Man, I'm sorry, but that
looks *very* much like random data corruption to me, not a kernel bug. I
could understand the lockups and related oopses being kernel related, and
yours still may be, but my first area to look on your machine would be
hardware. My second area to look on your machine would be for some driver
that is using a wild pointer and writing to it. However, the IDE config I
posted above may help you if what we are seeing here is memory corruption
during an IDE swap transaction or something similar.

-- 

Doug Ledford <dledford@dialnet.net> Opinions expressed are my own, but they should be everybody's.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu