Re: 2.0.27 major problems #1 -- 3c59x driver.

Cameron MacKinnon (mackin@interlog.com)
Thu, 13 Feb 1997 01:19:16 -0500


> From: Chris Evans <chris@ferret.lmh.ox.ac.uk>
> On Wed, 12 Feb 1997, Philip Blundell wrote:
> > A transmitter access conflict is not disaster. There is no need to
> > reinitialise the controller - all it means is that the driver's
> > transmit routine was reentered, and the second transmit was deferred to
> > avoid contention.
>
> I am forced to disagree -- when your card hangs it certainly _is_ a
> disaster. Additionally, the code implies that that if execution reaches
> this stage it is a disaster anyway; quote "if this ever happens then the
> queue layer is doing something evil"

NOT being an expert in the Linux networking code, a few disinterested
observations:
- Maybe the evil IS in the queue layer, and others haven't noticed as
their ethernet performance isn't as stellar as yours. Do the errors
occur randomly, or only under high load?
- Is there any way of a) dumping the stack and freezing when the error
occurs, so as to analyze the state of the kernel that led to the error
(easy, just write it 8-), b) writing a special return code when this
occurs so that succeeding higher layers of network code can dump all
appropriate state (see answer to a above) c) disabling all except disk
interrupts and writing a kernel or entire machine core image to swap
space when this occurs?

Maybe I've missed some information on this thread, but the information
I've seen so far "Somewhere at or after reaching <vaguely defined state
x> my machine hangs" doesn't give a potential debugger much to go on.
How many printk()s have been added to the code so far in an attempt to
understand what's going on? Calls to a function to dump state? Rather
than wasting time arguing whether it's a problem or not, the affected
user should endeavour to provide as much information as possible. This
may involve kernel modifications, hired help, packet sniffers, in
circuit emulators, experimentation with different hardware, voodoo, dead
poultry and inconvenience to users. On the other hand, if he finds it
more cost effective to replace the offending hardware and/or OS, so be
it.

My sincerest apologies if I've missed something relevant.