Re: [PATCH] scsi: ata: Fix a race condition between scsi error handler and ahci interrupt

From: Li Nan
Date: Mon Aug 14 2023 - 09:21:42 EST

Next message: Willy Tarreau: "Re: [PATCH v5] tools/nolibc: fix up size inflate regression"
Previous message: Karol Herbst: "Re: 2b5d1c29f6c4 ("drm/nouveau/disp: PIOR DP uses GPIO for HPD, not PMGR AUX interrupts")"
In reply to: Damien Le Moal: "Re: [PATCH] scsi: ata: Fix a race condition between scsi error handler and ahci interrupt"
Next in thread: Damien Le Moal: "Re: [PATCH] scsi: ata: Fix a race condition between scsi error handler and ahci interrupt"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

在 2023/8/14 15:50, Damien Le Moal 写道:

On 8/14/23 15:41, Li Nan wrote:

This is definitely not correct because EH may have been scheduled for a non
fatal action like a device revalidate or to get sense data for successful
commands. With this change, the port will NOT be frozen when a hard error IRQ
comes while EH is waiting to start, that is, while EH waits for all commands to
complete first.

Yeah, we should find a better way to fix it. Do you have any suggesstions?

Furthermore, if you get an IRQ that requires the port to be frozen, it means
that you had a failed command. In that case, the drive is in error state per
ATA specs and stops all communication until a read log 10h command is issued.
So you should never ever see 2 error IRQs one after the other. If you do, it
very likely means that you have buggy hardware.

How do you get into this situation ? What adapter and disk are you using ?

> How do you get into this situation ?
The first IRQ is io error, the second IRQ is disk link flash break.

What does "link flash break" mean ?

> What adapter and disk are you using ?
It is a disk developed by our company, but we think the same issue
exists when using other disks.

As I said, I find this situation highly suspect because if the first IRQ was to
signal an IO error that the drive reported, then per ATA specifications, the
drive should be in error mode and should NOT have transmitted any other FIS
after the SDB FIS that signaled the error. Nothing at all should come after that
error SDB FIS, until the host issues a read log 10h to get thee drive out of
error state.

If this is a prototype device, I would recommend that you take an ATA bus trace
and verify the FIS traffic. Something fishy is going on with the drive in my
opinion.

Thank you for your patient explanation. I'm sorry I didn't explain the
problem clearly before. After discussing with my colleagues who know
more about dirvers, Let me re-describe the problem.

The problem`s situation is the SATA link is quickly disconnected and connected. For example, when an I/O error is processed in error handling thread, the disk is manually removed and inserted, and the AHCI chip reports a hot plug interrupt.

This scenario is not just an NCQ error, but a disk is removed and quickly inserted before the error processing is completed. For the error handling process, the disk status needs to be restored after the error handling is complete.

--
Thanks,
Nan

Next message: Willy Tarreau: "Re: [PATCH v5] tools/nolibc: fix up size inflate regression"
Previous message: Karol Herbst: "Re: 2b5d1c29f6c4 ("drm/nouveau/disp: PIOR DP uses GPIO for HPD, not PMGR AUX interrupts")"
In reply to: Damien Le Moal: "Re: [PATCH] scsi: ata: Fix a race condition between scsi error handler and ahci interrupt"
Next in thread: Damien Le Moal: "Re: [PATCH] scsi: ata: Fix a race condition between scsi error handler and ahci interrupt"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]