Re: 3com 3c905c-txm

From: Andrey Savochkin (saw@saw.sw.com.sg)
Date: Sun May 14 2000 - 21:56:23 EST


Hello,

On Sat, May 13, 2000 at 12:17:29AM -0400, Donald Becker wrote:
[snip]
> I know all about not being able to reproduce problems. Some versions of the
> eepro100 chip have a bug where they switch into "broken mode". The hardware
> and driver will work fine for weeks, then something will go wrong
> (presumably with the internal firmware). Despite resetting everything, the
> chip will stop again after sending just a few packets.
>
> The problem for me is when someone encounters this, makes a driver change,
> and their modified driver works for a week without a problem. They proclaim
> that their new driver is much more reliable, and that they have fixed The
> Bug.
>
> Usually they haven't fixed anything, or even introduced bugs, but to them
> all evidence points to a successful fix. When I say "that's not a fix",
> they bypass me and submit a patch to Linus. Linus, not knowing the whole
> story, puts the patch in. After all, here is someone Doing Something about
> The Problem, as opposed to Donald which is trying to keep everything a
> secret over on the mailing lists. (I'm trying to minimize what he has to
> deal with, and trying to minimize change points in the mainline kernel.)
>
> The bottom line is that for a well established code you should establish
> what the actual bug is. That means being able to reproduce it at will, and
> having a good explaination of how it is occuring. Ideally you should
> measure or directly demonstrate what is happening.
>
> There are things that mask bugs, but don't fix them. Putting in locks, or
> randomly reordering the code frequently has this effect. Locks, especially,
> slow the code down and can reduce the symptom frequency without removing the
> true problem.

Well, I want to clarify references to eepro100.
You seemed to chose the wrong example.

Yes, I've made a lot of changes to your original driver.
But I accompany my driver with the changelog file
ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/v2.3/eepro100.changelog
and keep record of all changes. I can reproduce and explain every fixed bug
and justify the fixes.

I suspect I know what you mean under "broken mode".
The chip really drives crazy when it follows NULL link in RX ring.
Unfortunately, many of your original drivers including eepro100 were
completely incapable to survive in out-of-memory conditions.
In eepro100.c you
 - dereference potentially NULL pointer in show_state()
 - restart the RX process seeing "no resource" flag from the chip
 - restart the chip in tx_timeout (which follows going through NULL link
   in RX ring and chip hang).
Well, I may forget something else...

skb allocation failures are rare and it make take months to hit such a
condition again. So it may make you thinking that the driver works without
problems. _I_ do not buy such an evidence. I stress tested the driver on a
busy proxy server with /proc/sys/vm/freepages artificially reduced to very
low numbers. During tests I hit skb allocation failures every couple of
minutes and really verified that my code works.

Regards
                                        Andrey V.
                                        Savochkin

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon May 15 2000 - 21:00:25 EST