Re: HELP! Bug in aic7xxx + scsi?

Steve Sparks (sparks@socketware.com)
Wed, 17 Nov 1999 10:30:50 -0500


--------------093718F2686E8C6AF913FCE9
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Ok. I got the machine back together and I've got it using the Quantum drive -
not as a primary, but to get things off. Here's what I found:

* copying >10MB files caused it to crash.
* I got the source to 'cp' and added a call to sync() after every read/write
pair in copy.c. I was able to copy huge files without crashing - slowly, what
with all the syncs, but stably. Perhaps there's something to the "buffer-list
destroyed" messages i was seeing.
* gpart may well be the coolest utility on the planet.

I'm not a kernel hacker, but perhaps something has changed about how file
buffers are stacked up before a sync? the last thing it would print before
locking up would always be "in swapper task- not syncing".

This is no longer a critical problem for me, since I don't rely on these drives
anymore, but it's still a problem to others. I got mail from a guy re: my post
yesterday saying he had the same problem, but he identified his errors as
coming after an upgrade from 2.2.10 to 2.2.13. These are all quantum scsi
drives - has scsi.c changed?

the quantum drive's .config and /var/log/messages can be found at
http://www.cs.alfred.edu/~sparkssc/quantum - thanks for any and all help :)

-S

Steve Sparks wrote:

> I've got a problem I'm having a hard time debugging, but here's what I have
> so far (these are two bugs, one should be easy to find&fix)
>
> 1) I have an AIC-7895. When I set the kernel up to compile in the aic7xxx
> driver, it loads it in the kernel but then it also loads the module -
> redetecting the two channels, for a total of four hosts, causing an
> interrupt conflict, and panicking. No biggie, I just load aic7xxx by module
> only, but you might want to fix it at leisure.
>
> 2) *THIS IS THE KILLER PROBLEM*
>
> I have a Quantum Viking II 8.7gb drive. I was running RH5.2 (lk 2.0.36) on
> it for ~8 months with no problems. Yesterday I tried to upgrade to RH6.0 (lk
> 2.2.5-15) and immediately began having errors.
>
> The condition was any bulk read or write, ie. copying a 150MB file around or
> tarring up a big tree or running a tape backup. The message was something
> like
>
> Kernel panic: scsi_free: attempting to free unused memory
>
> I don't recall exactly. There were other errors - some bus timeouts - but
> most were the above. The machine would not boot after that - when it detects
> the filesystems, it would e2fsck since they weren't cleanly unmounted; the
> act of e2fsck would cause the error again. I ran out and bought a WD IDE
> drive; I got RH60 installed, but the partition table on the quantum drive is
> so hosed that fdisk prints an empty table (but, surprisingly, properly
> detects the CHS of the drive.) If anyone has a utility that will allow me to
> at least detect the ext2 partitions I can get my .config and
> /var/log/messages to help debugging.
>
> I used to have Quantum Atlas SCSI hot-swappables in a Dell Poweredge 2300
> running 2.2.5-15, and it crashed all the time too; our RAID card vendor said
> something about Quantum's termination being non-standard, but I don't know
> if that was tech support BS or for real. After having the viking blow up
> with 2.2.5, I think it was for real.
>
> We replaced the hot-swappables with IBM drives and the condition went away.
>
> In any event, it worked with 2.0.36 scsi; it fails with 2.2.5 scsi. On
> different hardware, different motherboards, the drive manufacturer is the
> only obvious similarity except that both machines are SMP, one is a dual
> P2/400 the poweredge is a dual P2/450. In both cases removing the quantum
> drive stopped the problem.
>
> If anyone knows how to get my logs/configs off the quantum drive, i'd be
> super-willing to do that - I need all the freaking files back off it too!
> (my tape backup gave me ~75% of the machine back)
>
> any, all, and as much help as I could get would be a source of eternal
> gratitude.
>
> -S
> --
> Steven Sparks Socketware, Inc. (http://www.accucast.com)
> Guru 1776 Peachtree St. NW, Suite 500 South
> sparks@socketware.com Atlanta, GA 30308
> (404)815-1998 x15 1-877-4-ACCUCAST (422-2822)
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.rutgers.edu
> Please read the FAQ at http://www.tux.org/lkml/

--
Steven Sparks                     Socketware, Inc. (http://www.accucast.com)
Guru                              1776 Peachtree St. NW, Suite 500 South
sparks@socketware.com             Atlanta, GA 30308
(404)815-1998 x15                 1-877-4-ACCUCAST (422-2822)

--------------093718F2686E8C6AF913FCE9 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">  
Ok. I got the machine back together and I've got it using the Quantum drive - not as a primary, but to get things off. Here's what I found:

* copying >10MB files caused it to crash.
* I got the source to 'cp' and added a call to sync() after every read/write pair in copy.c. I was able to copy huge files without crashing - slowly, what with all the syncs, but stably. Perhaps there's something to the "buffer-list destroyed" messages i was seeing.
* gpart may well be the coolest utility on the planet.

I'm not a kernel hacker, but perhaps something has changed about how file buffers are stacked up before a sync? the last thing it would print before locking up would always be "in swapper task- not syncing".

This is no longer a critical problem for me, since I don't rely on these drives anymore, but it's still a problem to others. I got mail from a guy re: my post yesterday saying he had the same problem, but he identified his errors as coming after an upgrade from 2.2.10 to 2.2.13. These are all quantum scsi drives - has scsi.c changed?

the quantum drive's .config and /var/log/messages can be found at http://www.cs.alfred.edu/~sparkssc/quantum - thanks for any and all help :)

-S
 

Steve Sparks wrote:

I've got a problem I'm having a hard time debugging, but here's what I have
so far (these are two bugs, one should be easy to find&fix)

1) I have an AIC-7895. When I set the kernel up to compile in the aic7xxx
driver, it loads it in the kernel but then it also loads the module -
redetecting the two channels, for a total of four hosts, causing an
interrupt conflict, and panicking. No biggie, I just load aic7xxx by module
only, but you might want to fix it at leisure.

2) *THIS IS THE KILLER PROBLEM*

I have a Quantum Viking II 8.7gb drive. I was running RH5.2 (lk 2.0.36) on
it for ~8 months with no problems. Yesterday I tried to upgrade to RH6.0 (lk
2.2.5-15) and immediately began having errors.

The condition was any bulk read or write, ie. copying a 150MB file around or
tarring up a big tree or running a tape backup. The message was something
like

 Kernel panic: scsi_free: attempting to free unused memory

I don't recall exactly. There were other errors - some bus timeouts - but
most were the above. The machine would not boot after that - when it detects
the filesystems, it would e2fsck since they weren't cleanly unmounted; the
act of e2fsck would cause the error again. I ran out and bought a WD IDE
drive; I got RH60 installed, but the partition table on the quantum drive is
so hosed that fdisk prints an empty table (but, surprisingly, properly
detects the CHS of the drive.) If anyone has a utility that will allow me to
at least detect the ext2 partitions I can get my .config and
/var/log/messages to help debugging.

I used to have Quantum Atlas SCSI hot-swappables in a Dell Poweredge 2300
running 2.2.5-15, and it crashed all the time too; our RAID card vendor said
something about Quantum's termination being non-standard, but I don't know
if that was tech support BS or for real. After having the viking blow up
with 2.2.5, I think it was for real.

We replaced the hot-swappables with IBM drives and the condition went away.

In any event, it worked with 2.0.36 scsi; it fails with 2.2.5 scsi. On
different hardware, different motherboards, the drive manufacturer is the
only obvious similarity except that both machines are SMP, one is a dual
P2/400 the poweredge is a dual P2/450. In both cases removing the quantum
drive stopped the problem.

If anyone knows how to get my logs/configs off the quantum drive, i'd be
super-willing to do that - I need all the freaking files back off it too!
(my tape backup gave me ~75% of the machine back)

any, all, and as much help as I could get would be a source of eternal
gratitude.

-S
--
Steven Sparks                     Socketware, Inc. (http://www.accucast.com)
Guru                              1776 Peachtree St. NW, Suite 500 South
sparks@socketware.com             Atlanta, GA 30308
(404)815-1998 x15                 1-877-4-ACCUCAST (422-2822)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

--
Steven Sparks                     Socketware, Inc. (http://www.accucast.com)
Guru                              1776 Peachtree St. NW, Suite 500 South
sparks@socketware.com             Atlanta, GA 30308
(404)815-1998 x15                 1-877-4-ACCUCAST (422-2822)
  --------------093718F2686E8C6AF913FCE9-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/