RE: 2.0.33 locks + aic7xxx warning message

Doug Ledford (dledford@dialnet.net)
Tue, 06 Jan 1998 14:13:54 -0600 (CST)


On 06-Jan-98 Burkhard Bunk wrote:
>Dead Doug,
^^^^
I hope that's a typo and not a psychic prediction %^)

>with kernel 2.0.33 (both with and without SMP), I found the following
>strange
>warning amoung the boot messages:
>
>....
>kernel: aic7xxx: <Adaptec AHA-294X Ultra SCSI host adapter> at PCI 13
>kernel: aic7xxx: Warning - detected auto-termination. Please verify driver
>kernel: detected settings and use manual termination if necessary.

This indicates that the card is automatically configuring the termination
for you at each boot (in the Adaptec SCSI BIOS the termination setting is
set to Auto as oppossed to actually telling the card what termination to
use). If it works fine for you, then great, but if there are problems, then
setting the termination to the proper type in the Adaptec BIOS has been
known to solve a few problems. That's what this message is all about. So,
for instance, if you went into the Adaptec SCSI BIOS and set the termination
to Low On/High On (which should be correct given the bootup info from below)
the message would be gone and we would know that termination is configured
properly.

>kernel: aic7xxx: BIOS enabled, IO Port 0x8000, IO Mem 0xe4000000, IRQ 11,
>Revision B
>kernel: aic7xxx: Single Channel, SCSI ID 7, 16/255 SCBs, QFull 16, QMask
>0x1f
>kernel: scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI)
>4.1.1/3.2.1
>kernel: scsi : 1 host.
>kernel: scsi0: Scanning channel A for devices.
>kernel: Vendor: QUANTUM Model: FIREBALL_TM2110S Rev: 300X
>kernel: Type: Direct-Access ANSI SCSI revision: 02
>kernel: Detected scsi disk sda at scsi0, channel 0, id 0, lun 0
>kernel: scsi : detected 1 SCSI disk total.
>kernel: SCSI device sda: hdwr sector= 512 bytes. Sectors= 4124736 [2014 MB]
>[2.0 GB]
>....
>
>--------------------------------
>My configuration:
>Tyan Titan pro Dual Pentium Pro board with 2 x PPro/200MHz
>128MB RAM (mem=127M in lilo.conf)
>SMC Ultra ethernet card
>
>The SCSI adapter identifies as
>Adaptec AHA-2940 Ultra/Ultra W BIOS v 1.23
>
>This is /proc/scsi/aic7xxx/0:
>
>Adaptec AIC7xxx driver version: 4.1.1/3.2.1
>Compile Options:
> AIC7XXX_RESET_DELAY : 15
> AIC7XXX_CMDS_PER_LUN : 8
> AIC7XXX_TAGGED_QUEUEING: Enabled
> AIC7XXX_PAGE_ENABLE : Enabled
> AIC7XXX_PROC_STATS : Disabled
>
>Adapter Configuration:
> SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter
> (AIC-788x chipset)
> Host Bus: Single
> Base IO: 0x8000
> Base IO Memory: 0xe4000000
> IRQ: 11
> SCBs: Used 8, HW 16, Page 255
> Interrupts: 27717
> Serial EEPROM: True
> Extended Translation: Enabled
> SCSI Bus Reset: Enabled
> Ultra SCSI: Disabled
>Disconnect Enable Flags: 0xff
>
>-----------------------------------------------
>I am plagued with lockups of the system anyway, they occur every few days
>without SMP and within a few hours with SMP. The systems hangs with no
>messages
>around, as far as I can see.
>I tend to blame the SCSI adapter for that, because I have 3 more machines
>with
>(almost) the same hardware, but they say
>
>Adaptec AHA-2940 AU BIOS v 1.30
>
>and are stable (even under SMP) for several months.
>
>I tried kernel 2.1.72 + SMP on the unstable machines, with a strange
>result:
>it locked cpu0 and continued to run on cpu1. This also survived a reboot,
>I had to press the reset button to get cpu0 back to work...

This, actually, sounds like what I wanted to hear. The results under SMP on
2.1.x are what I would expect from a particular error condition that isn't
handled in the current version of the driver, but is handled in the new code
I'm working on. Essentially, if you get a PCI bus error during operation on
this machine, there is an interrupt generated to signal that error. The
current version of the driver doesn't know how to deal with that error and
therefore never clears the interrupt source. As a result, the machine
doesn't really lock up, it's just being flooded with PCI error interrupts so
fast that CPU0 can do nothing else but answer the interrupt. I expect my
new version of the code before too much longer (I *just* got through with
the new interrupt handler last night, and I'm now on through to the various
pieces of probe code and whatnot, but I haven't even made my first compile
test yet and I purposefully renamed a bunch of defines to force me to go
back and correct all of them when I start compiling, so there is still a
decent amount of work left). That code will handle this error condition and
also notify the user of the PCI bus errors. FWIW, the card itself shouldn't
be any problem for you, I use that exact same card on my home machine and it
works flawlessly for me, so I would tend to say that the PCI error condition
is a distinct possibility on your machine.

>
>Any ideas about further diagnostics and possible fixes are welcome
>(and badly needed)!

It would take a little time, but I could probably back port that PCI Error
condition code into the current driver for you to test and see if that makes
a difference. However, I haven't even so much as compile tested it, so I
make no gaurantees at this point in time. Let me know if you'd like an
alpha/beta patch for this problem :)

----------------------------------
E-Mail: Doug Ledford <dledford@dialnet.net>
Date: 06-Jan-98
Time: 14:13:56
----------------------------------