RE: 2.0.33 locks + aic7xxx warning message

Burkhard Bunk (bunk@birke.physik.hu-berlin.de)
Mon, 12 Jan 1998 17:56:49 +0100 (MET)


Hi Doug,
^^
[that's typo safe]

manual termination ENABLED on the AHA-2940 Ultra, and it works, many
thanks! It's up and running linux 2.0.33-SMP for >3 days, that's good
news. Btw, if I DISABLED termination and attached a (physical) terminator
instead, is that also ok?

> On 06-Jan-98 Burkhard Bunk wrote:

> >with kernel 2.0.33 (both with and without SMP), I found the following
> >strange
> >warning amoung the boot messages:
> >
> >....
> >kernel: aic7xxx: <Adaptec AHA-294X Ultra SCSI host adapter> at PCI 13
> >kernel: aic7xxx: Warning - detected auto-termination. Please verify driver
> >kernel: detected settings and use manual termination if necessary.
>
> This indicates that the card is automatically configuring the termination
> for you at each boot (in the Adaptec SCSI BIOS the termination setting is
> set to Auto as oppossed to actually telling the card what termination to
> use). If it works fine for you, then great, but if there are problems, then
> setting the termination to the proper type in the Adaptec BIOS has been
> known to solve a few problems. That's what this message is all about. So,
^^^^^
Where is this knowledge going? Some time ago, it used to be in the
various HOWTO documents, but they are pretty out of date by now
(Hardware, Kernel, SCSI) or don't exist at all (SMP). For
troubleshooting, we need checklists of known problems!
============================
(Otherwise, the mailing lists will be flooded by FAQs.)

> >-----------------------------------------------
> >I am plagued with lockups of the system anyway, they occur every few days
> >without SMP and within a few hours with SMP. The systems hangs with no
> >messages
> >around, as far as I can see.
> >I tend to blame the SCSI adapter for that, because I have 3 more machines
> >with
> >(almost) the same hardware, but they say
> >
> >Adaptec AHA-2940 AU BIOS v 1.30
> >
> >and are stable (even under SMP) for several months.
> >
> >I tried kernel 2.1.72 + SMP on the unstable machines, with a strange
> >result:
> >it locked cpu0 and continued to run on cpu1. This also survived a reboot,
> >I had to press the reset button to get cpu0 back to work...
>
> This, actually, sounds like what I wanted to hear. The results under SMP on
> 2.1.x are what I would expect from a particular error condition that isn't
> handled in the current version of the driver, but is handled in the new code
> I'm working on. Essentially, if you get a PCI bus error during operation on
> this machine, there is an interrupt generated to signal that error. The
> current version of the driver doesn't know how to deal with that error and
> therefore never clears the interrupt source. As a result, the machine
> doesn't really lock up, it's just being flooded with PCI error interrupts so
> fast that CPU0 can do nothing else but answer the interrupt. I expect my
> new version of the code before too much longer (I *just* got through with
> the new interrupt handler last night, and I'm now on through to the various
> pieces of probe code and whatnot, but I haven't even made my first compile
> test yet and I purposefully renamed a bunch of defines to force me to go
> back and correct all of them when I start compiling, so there is still a
> decent amount of work left). That code will handle this error condition and
> also notify the user of the PCI bus errors. FWIW, the card itself shouldn't
> be any problem for you, I use that exact same card on my home machine and it
> works flawlessly for me, so I would tend to say that the PCI error condition
> is a distinct possibility on your machine.
>
> >
> >Any ideas about further diagnostics and possible fixes are welcome
> >(and badly needed)!
>
> It would take a little time, but I could probably back port that PCI Error
> condition code into the current driver for you to test and see if that makes
> a difference. However, I haven't even so much as compile tested it, so I
> make no gaurantees at this point in time. Let me know if you'd like an
> alpha/beta patch for this problem :)
^^^^^^^^^^^^^^^^
I am going to release one of the `cured' machines for SMP production,
but can afford to reserve one for occasional tests etc. So, if you
have a patch, I will certainly test it.

Greetings,
Burkhard.