RE: PATCH: Further aacraid work

From: Salyzyn, Mark
Date: Tue Jun 29 2004 - 15:56:59 EST


Oooo, good!

The Adapter has declared itself `dead' (maybe for reasons other than a
controlled blinkLED situation). Effectively a crash or complete loss of
communications with the Adapter. Last time I've seen this happened it
was a bad power supply.

Contact technical support to have them narrow down the actual cause of
adapter failure ... I don't know where the F/W updates are held on the
Dell Site.

Sincerely -- Mark Salyzyn

-----Original Message-----
From: Byron Stanoszek [mailto:gandalf@xxxxxxxxx]
Sent: Tuesday, June 29, 2004 4:21 PM
To: Salyzyn, Mark
Cc: Mark Haverkamp; Alan Cox; linux-kernel; linux-scsi
Subject: RE: PATCH: Further aacraid work

On Tue, 29 Jun 2004, Salyzyn, Mark wrote:

> Although I am gratified that 1.1.5-2345 works, and works well (*many*
> performance improvements have occurred in this driver, but some of
those
> have reached Alan's patch), I am at a loss because I was sure that the
> lockup was associated with commands taking longer than 60 seconds to
> complete (a 'can happen' with a complicated RAID card with multiple
> targets, that unfortunately gives the SCSI layer and especially ext3
> some headaches). The workaround in the reset handler is to let the
> commands quiesce to give the Firmware some breathing space to respond
to
> the test unit ready command that is issued to the target immediately
> following return from the reset handler.
>
> This not solving the problem has the implication, as Alan had noted,
> that this could be an issue with the adapter itself locking up or
going
> unresponsive under load. Both reports, this one and "PERC 3/Di broken
> 2.6.6-mm5 -> 2.6.7-rc1-mm" now smell of some subtle problems creeping
> in.

Turns out I spoke too soon. I repeated the rsync and logged this
information
from the serial console:

aacraid: Host adapter reset request. SCSI hang ?
aacraid: Host adapter appears dead -3
SCSI error : <1 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 8079007
Buffer I/O error on device sda1, logical block 1009868
lost page write due to I/O error on sda1
scsi1 (0:0): rejecting I/O to offline device
Buffer I/O error on device sda1, logical block 1009869
lost page write due to I/O error on sda1
...etc
Buffer I/O error on device sda1, logical block 1009877
lost page write due to I/O error on sda1
SCSI error : <1 0 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 8079159
scsi1 (0:0): rejecting I/O to offline device
...etc


FYI, I have another [production] server sitting next to this one running
aacraid and so far it hasn't had any problems. It's a PowerEdge 2450,
unsure as
to which PercRaid device it has. So now I'm comparing two systems:

Production: PowerEdge 2450, aacraid driver (1.1.2-lk1), PERCRAID Mirror,
Unknown HW
AAC0: kernel 2.1.4 build 2951
AAC0: monitor 2.1.4 build 2951
AAC0: bios 2.1.0 build 2951
AAC0: serial 73840ad0fafaf001

and

Development: PowerEdge 2400, aacraid driver (1.1-5[2345]), PERC RAID5,
Perc 2/Si
AAC0: kernel 2.1-3 build 2939
AAC0: monitor 2.1-3 build 2939
AAC0: bios 2.1-3 build 2939
AAC0: serial 58801d0

Thing I noticed is that the firmware build on my Development server (the
one
that's crashing) is older. Where can I find a firmware upgrade? I've
been
scouring Dell's site the last 20 minutes and all I can find are Linux
drivers.

> Can we get some additional tests on this?
>
> 1) Does this problem occur on a single CPU (Hyperthreading disabled
> too)? I am assuming some locking issues appeared, mainly because of
some
> restructuring in that code.

I'm going to run the above test right now and let you know how it turns
out.

Just for some history, I've been running 2.4 kernels on this machine for
2
years and haven't had a problem with the aacraid driver. Also, the
default
Fedora Core 2 install I had on it yesterday seems to also work
flawlessly too,
and it's a 2.6.5-smp system.

-Byron

--
Byron Stanoszek Ph: (330) 644-3059
Systems Programmer Fax: (330) 644-8110
Commercial Timesharing Inc. Email: byron@xxxxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/