Re: [PATCH 00/71] More fixes, cleanup and modernization for NCR5380 drivers

From: Ondrej Zary
Date: Fri Nov 20 2015 - 02:35:31 EST


On Friday 20 November 2015, Finn Thain wrote:
>
> On Thu, 19 Nov 2015, Ondrej Zary wrote:
>
> > On Thursday 19 November 2015 03:24:56 Finn Thain wrote:
> >
> > > On Wed, 18 Nov 2015, Ondrej Zary wrote:
> > >
> > > >
> > > > I have some NCR5380 ISA cards and can test them.
> > >
> > > Thanks Ondrej. I've no idea which ISA drivers are presently working in
> > > mainline. Finding regressions may be more difficult than usual ;-)
> >
> > You're right... looks very broken:
> >
> > [ 62.577194] scsi host2: Generic NCR5380/NCR53C400 SCSI, io_port 0x240, n_io_port 16, base 0x0, irq 0, can_queue 16, cmd_per_lun 2,
> > sg_tablesize 128, this_id 7, flags { DTC3181E NO_PSEUDO_DMA }, USLEEP_POLL 3, USLEEP_WAITLONG 1250, options { AUTOPROBE_IRQ PSEUDO_DMA
> > NCR53C400 }
> > [ 62.796635] scsi 2:0:0:0: Direct-Access IBM 0663 e PQ: 0 ANSI: 2
> > [ 63.878494] sd 2:0:0:0: Attached scsi generic sg1 type 0
> > [ 95.848260] sd 2:0:0:0: aborting command
> >
> > And the system hangs completely.
> >
>
> Yes. That was the usual failure mode. The old EH abort routine is fatal.
> Up until I disabled PDMA by default for mac_scsi (in v3.19), that driver
> would do the same thing.
>
> > It's much better with your patches, but still not great :)
> >
>
> Pleased to hear it :)
>
> > [ 93.963264] pnp 01:01.00: [io 0x0240-0x025f]
> > [ 93.963493] pnp 01:01.00: [irq 5]
> > [ 93.965768] pnp 01:01.00: activated
> > [ 93.977147] scsi host2: Generic NCR5380/NCR53C400 SCSI, io_port 0x240, n_io_port 16, base 0x0, irq 0, can_queue 16, cmd_per_lun 2,
> > sg_tablesize 128, this_id 7, flags { DTC3181E NO_PSEUDO_DMA }, options { AUTOPROBE_IRQ PSEUDO_DMA }
> > [ 93.987527] scsi host2: rejecting message
> > [ 93.987647] Synchronous Data Transfer Request period = 100 ns offset = 12
> > [ 94.001219] scsi 2:0:0:0: Direct-Access IBM 0663 e PQ: 0 ANSI: 2
> > [ 113.000794] sd 2:0:0:0: Attached scsi generic sg1 type 0
>
> I'd be interested to know what commands were in play in that 19 second
> interval. Might need to use scsi_logging_level to figure that out.
>
> My tests involved 3 different scsi targets (two disks and a CD-ROM) but
> none of these send a SDTR. Your log says the driver correctly rejected the
> SDTR message but that doesn't mean the target actually went to MSG IN
> phase and got the message. Do you have any older targets you can test?

Yes, I have some older disks too and also CD-ROMs. This one was just handy in
an external enclosure (the card has only an external DB25 connector). It can
be opened easily so I'll test the other devices too.

> > [ 144.852432] sd 2:0:0:0: [sdb] Unit Not Ready
> > [ 144.852574] sd 2:0:0:0: [sdb] Sense Key : Aborted Command [current]
> > [ 144.852713] sd 2:0:0:0: [sdb] Add. Sense: Select or reselect failure
>
> AFAIK, the target should not have to abort any commands. Moreover, the
> target should never experience a select/reselect failure, because you have
> irq == 0 (see above) and that implies that the target is never permitted
> the disconnect privilege.
>
> > [ 240.108292] INFO: task modprobe:1957 blocked for more than 120 seconds.
> > [ 240.108418] Not tainted 4.3.0-rc1+ #74
>
> Why not use v4.3?

I had that already built so just quickly applied the patches and tested. I have
to update the git tree anyway as ACPI is broken.

> > [ 240.108501] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 240.108597] modprobe D 0000001a 0 1957 1950 0x00000000
> > [ 240.108790] ce0fad00 00000086 53881781 0000001a c1525f88 4edbe39c 0000001a 04ac33e5
> > [ 240.109246] 00000000 ccd54000 ffffffff ffffffff d204b280 c139c504 00000000 c104416d
> > [ 240.109699] 00000000 ce0fad00 c1054a45 c151fd8c c151fd8c d204b280 00000000 ccd6d100
> > [ 240.110156] Call Trace:
> > [ 240.110295] [<c139c504>] ? schedule+0x5b/0x67
> > [ 240.110430] [<c104416d>] ? async_synchronize_cookie_domain+0x73/0x9f
> > [ 240.110569] [<c1054a45>] ? abort_exclusive_wait+0x6e/0x6e
> > [ 240.110699] [<c10ac9bc>] ? do_init_module+0xa4/0x1a3
> > [ 240.110824] [<c107ddb5>] ? load_module+0x14de/0x18ca
> > [ 240.110948] [<c107e2a0>] ? SyS_finit_module+0x47/0x56
> > [ 240.111068] [<c139e2c0>] ? sysenter_do_call+0x12/0x12
>
> Not sure what module was being probed here. I presume it was g_NCR5380 or
> g_NCR5380_mmio. Neither of these calls 'scsi_scan_host'. I'm not sure what
> the implications are (?)

It was g_NCR5380 (DTCT-436P card).

> > [ 240.852458] sd 2:0:0:0: [sdb] Read Capacity(10) failed: Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_SENSE
> > [ 240.852620] sd 2:0:0:0: [sdb] Sense Key : Aborted Command [current]
> > [ 240.852760] sd 2:0:0:0: [sdb] Add. Sense: Select or reselect failure
> > [ 272.852471] sd 2:0:0:0: [sdb] Write Protect is off
> > [ 272.852614] sd 2:0:0:0: [sdb] Mode Sense: 00 00 00 00
> > [ 304.084452] sd 2:0:0:0: [sdb] Asking for cache data failed
> > [ 304.084592] sd 2:0:0:0: [sdb] Assuming drive cache: write through
>
> This looks like nonsense to me ... I don't think the target actually
> aborted the reselection phase of a read capacity command. I'm out of ideas
> here. Can anyone else make sense of this?


--
Ondrej Zary
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/