Re: [PATCH] scsi: ata: Fix a race condition between scsi error handler and ahci interrupt

From: Li Nan
Date: Mon Aug 14 2023 - 02:42:43 EST



在 2023/8/10 10:49, Damien Le Moal 写道:
On 8/10/23 10:48, linan666@xxxxxxxxxxxxxxx wrote:
From: Li Nan <linan122@xxxxxxxxxx>


Please explain the problem first instead of starting with a function call
timeline which cannot ba analized without explanations.

interrupt scsi_eh

ahci_error_intr
=>ata_port_freeze
=>__ata_port_freeze
=>ahci_freeze (turn IRQ off)
=>ata_port_abort
=>ata_port_schedule_eh
=>shost->host_eh_scheduled++;
host_eh_scheduled = 1
scsi_error_handler
=>ata_scsi_error
=>ata_scsi_port_error_handler
=>ahci_error_handler
. =>sata_pmp_error_handler
. =>ata_eh_thaw_port
. =>ahci_thaw (turn IRQ on)
ahci_error_intr .
=>ata_port_freeze .
=>__ata_port_freeze .
=>ahci_freeze (turn IRQ off) .
=>ata_port_abort .
=>ata_port_schedule_eh .
=>shost->host_eh_scheduled++; .
host_eh_scheduled = 2 .
=>ata_std_end_eh
=>host->host_eh_scheduled = 0;

'host_eh_scheduled' is 0 and scsi eh thread will not be scheduled again,
and the ata port remain freeze and will never be enabled.

If EH thread is already running, no need to freeze port and schedule
EH again.

Reported-by: luojian <luojian5@xxxxxxxxxx>
Signed-off-by: Li Nan <linan122@xxxxxxxxxx>
---
drivers/ata/libahci.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
index e2bacedf28ef..0dfb0b807324 100644
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -1840,9 +1840,17 @@ static void ahci_error_intr(struct ata_port *ap, u32 irq_stat)
/* okay, let's hand over to EH */
- if (irq_stat & PORT_IRQ_FREEZE)
+ if (irq_stat & PORT_IRQ_FREEZE) {
+ /*
+ * EH already running, this may happen if the port is
+ * thawed in the EH. But we cannot freeze it again
+ * otherwise the port will never be thawed.
+ */
+ if (ap->pflags & (ATA_PFLAG_EH_PENDING |
+ ATA_PFLAG_EH_IN_PROGRESS))
+ return;

This is definitely not correct because EH may have been scheduled for a non
fatal action like a device revalidate or to get sense data for successful
commands. With this change, the port will NOT be frozen when a hard error IRQ
comes while EH is waiting to start, that is, while EH waits for all commands to
complete first.


Yeah, we should find a better way to fix it. Do you have any suggesstions?

Furthermore, if you get an IRQ that requires the port to be frozen, it means
that you had a failed command. In that case, the drive is in error state per
ATA specs and stops all communication until a read log 10h command is issued.
So you should never ever see 2 error IRQs one after the other. If you do, it
very likely means that you have buggy hardware.

How do you get into this situation ? What adapter and disk are you using ?


> How do you get into this situation ?
The first IRQ is io error, the second IRQ is disk link flash break.

> What adapter and disk are you using ?
It is a disk developed by our company, but we think the same issue exists when using other disks.

ata_port_freeze(ap);
- else if (fbs_need_dec) {
+ } else if (fbs_need_dec) {
ata_link_abort(link);
ahci_fbs_dec_intr(ap);
} else


--
Thanks,
Nan