Re: Final pre-2.0.31.. Expect this to be same as real 2.0.31

Tomasz Motylewski (motyl@crds.chemie.unibas.ch)
Wed, 10 Sep 1997 08:14:52 +0200 (MET DST)


Before getting into details: the patch (buffer.c) Dr. Werner Fink has sent
yesterday to some people worked for me.

On Mon, 8 Sep 1997, Doug Ledford wrote:

> > scsi : aborting command due to timeout : pid 155159, scsi0, channel 0, id 1,lun 0 Read (6) 17 e4 d6 18 00
> > and many other relating messages
> > (http://crds.chemie.unibas.ch/machine/2.0.pre31-final/sp-17-8M-2xbonnie+badblocks)
[...]
>
> The other thing that is unusual is the Sending SDTR!! messages. Do you
> happen to have Synchronous negotiation disabled on that device in your
> Adaptec setup? If so, you should enable it, as our negotiation routines are
> much more reliable when we start the negotiation instead of the device
> starting the negotiation (this is a limitation of the amount of code space
> we have in the sequencer).

I have in SCSI BIOS configuration:

HOST adapter BIOS <enabled>
Support removable disks as fixed disks <boot only>
Extended BIOS translations fot DOS drives <disabled>
Multiple LUN support <disabled>
BIOS support for bootable CDROM <enabled>
Support for Ultra SCSI speed <enabled>

Per device options (for all devices the same - default):

Initiate sync negetiation <yes>
Maximum sync transfer rate 40
enable disconnection <yes>
initiate wide negotiation <yes>
send start unit command <yes>

On both disks (WDE 4360) there is a jumper "disable target initiated
synchronous/wide negotiation" which I left open (default). That means the
disks may initiate synchronous negotiations and probably do it before the
controller.

>
> And finally, one thing a person can do that might reduce the number of SCSI
> abort messages on the aic7xxx chipsets under very heavy load, is go to
> roughly line 6723 where you should find something like:
>
> if (p->device_status[TARGET_INDEX(cmd)].commands_sent < 200)
>
> The last number in that if statement can be adjusted up or down, with
> smaller numbers providing a reduced chance that a command will timeout while
> on the SCSI bus (unless there is a bus hang, in which case this setting
> doesn't really matter). To be fairly certain nothing will ever timeout, you
> can reduce that number to around 50.
>

I have first applied the patch by Leonard Zubkoff, and I never got that
forwarded interrupt timeout panic. All runs with
bonnie -s 100 (sda1);
badblocks -vw /dev/sda2 122880
bonnie -s 200 (sdb1)
were successfull (several (2-10) SCSI timeouts and resets by the end, in SEEK
phase, fast recovery). Decreasing that constant (200->50) did not change
anything (once I even got the first timeout much earlier).

BUT then I wanted to add some network activity. Pinging the other host with:

ping -f otherhost or ping -f -s 1024 otherhost

during the IO run did not break anything. I was running everything in 8 MB
RAM + 130 MB swap in single user mode, under screen (4 shells: bonnie,
bonnie, badblocks + monitoring) in SMP kernel.

BUT pinging my host from the other host (10 MB ethernet) ALWAYS caused the
crash: (used "v-19" (old value of 200 in aicxxxx.c))

first no problems for a long time
then known already SCSI timeouts
then SCSI bus reset
followed immediately by lots of "can not allocate skb of size 64" (or 1066)
after 2-3 seconds Aiee: scheduling in interrupt 0012b118 (or 0012b8d5)
scrolling very fast (end of story, hard reset)

using "v-20" (value of 50) and ping -s -f 1024 from otherhost I got again
SCSI reset, then "could not allocate skbuf of size 1066" scrolling, then
kernel panic: aicxxxx : unable to proceed with device negotiation.

and then after 20 seconds:
Aiee: scheduling in interrupt 0012b118
(hard reset)

This "unable to proceed with device negotiation" is called in aicxxxx.c in
two places, always because atomic memory allocation fails.

"v-19" and "v-20" System.map:
0012aea4 T __iget
0012b0a8 t __wait_on_inode
0012b160 T get_device_list
0012b1fc t get_fops

0012b7d8 T reset_dquot_ptrs
0012b838 T __wait_on_buffer
0012b920 t sync_buffers

And now the best part of the story. I have applied the patch I got from
Werner on top of "v-20" and got "v-21". I have made two standard tests
without external ping (once without any warnings, once with only one SCSI
reset)

I have made also three tests with external ping -f -s 1024 myhost. All of
tests completed without any problems (no SCSI timeuts). The bonnies output
has shown that output transfer rate was even slightly faster. Block
inputs are slightly slower, but "getc" input is faster.

--
Tomasz Motylewski