Re: [PATCH rc8-mm1] hotfix libata-scsi corruption

From: Hugh Dickins
Date: Tue Jan 22 2008 - 13:50:30 EST


On Tue, 22 Jan 2008, James Bottomley wrote:
> > --- 2.6.24-rc8-mm1/drivers/ata/libata-scsi.c 2008-01-17 16:49:47.000000000 +0000
> > +++ linux/drivers/ata/libata-scsi.c 2008-01-22 15:45:40.000000000 +0000
> > @@ -826,7 +826,7 @@ static void ata_scsi_sdev_config(struct
> > sdev->max_device_blocked = 1;
> >
> > /* set the min alignment */
> > - blk_queue_update_dma_alignment(sdev->request_queue, ATA_DMA_PAD_SZ - 1);
> > + blk_queue_update_dma_alignment(sdev->request_queue, ATA_SECT_SIZE - 1);
> > }
> >
> > static void ata_scsi_dev_config(struct scsi_device *sdev,
>
> Unfortunately, that's likely not the entire hot fix ... the implication
> is that we have some mapping error in the way we do direct SG_IO.

Quite possibly, I'm not sure.

> What the fix you propose does is make it far more likely that block will
> copy, perform I/O then uncopy (almost certain, since most smartd data
> transfers are well under ATA_SECT_SIZE, which is 512). However,
> implicating a generic path like this implies that we would get the same
> problem for SCSI commands as well, so the correct hot fix is below.

I've not noticed any problems from the normal activity of the system,
only from smartd's sg_ioctl. My impression was that it's a libata
issue, because it's going through ata_pio_sector, which does

ap->ops->data_xfer(qc->dev, buf + offset, qc->sect_size, do_write);

referring to sect_size, without considering the possibility of any smaller
I/O size. (Me, I don't even know why it's going PIO rather than DMA:
I'm assuming smartd does things that way, but there's no limit to my
ignorance here.)

> However, I'd like to see if we can track the problem through the SG_IO
> direct path ... how many adjacent page bytes are corrupt? Just a few or
> a large number (I'm wondering if it's an off by one or off by alignment
> type bug)?

I've assumed it's just the one next page: because ata_pio_sector is
doing a data_xfer of sect_size ATA_SECT_SIZE 512 to an offset above
0xe00 in the smartd stack page. The time I actually saw corruption
rather than an oops at startup, it was in a tmpfs swap vector page
running 64-bit kernel, and I didn't examine any further pages (just
checked the page before and matched it up to smartd's stack, already
suspecting that).

I don't believe it's an off-by-one at your SCSI end.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/