Re: [PATCH AUTOSEL 4.14 6/6] ata: libata-eh: do not clear ATA_PFLAG_EH_PENDING in ata_eh_reset()

From: Niklas Cassel
Date: Thu Sep 28 2023 - 12:00:36 EST


Hello Sasha,

The upstream commit that you are backporting here:
80cc944eca4f ("ata: libata-eh: do not clear ATA_PFLAG_EH_PENDING in ata_eh_reset()")

Requires that we have:
737dd811a3db ("ata: libahci: clear pending interrupt status")
in stable. This is currently not the case.

See my comment in:
https://lore.kernel.org/stable/ZRWf7Avtdv3DeqCU@x1-carbon/T/#t

Perhaps both:
93c7711494f4 ("ata: ahci: Drop pointless VPRINTK() calls and convert the remaining ones")
and
737dd811a3db ("ata: libahci: clear pending interrupt status")

Could be backported to stable.
That way,
80cc944eca4f ("ata: libata-eh: do not clear ATA_PFLAG_EH_PENDING in ata_eh_reset()")
should be safe to backport to stable as well.


Kind regards,
Niklas


On Sun, Sep 24, 2023 at 09:20:49AM -0400, Sasha Levin wrote:
> From: Niklas Cassel <niklas.cassel@xxxxxxx>
>
> [ Upstream commit 80cc944eca4f0baa9c381d0706f3160e491437f2 ]
>
> ata_scsi_port_error_handler() starts off by clearing ATA_PFLAG_EH_PENDING,
> before calling ap->ops->error_handler() (without holding the ap->lock).
>
> If an error IRQ is received while ap->ops->error_handler() is running,
> the irq handler will set ATA_PFLAG_EH_PENDING.
>
> Once ap->ops->error_handler() returns, ata_scsi_port_error_handler()
> checks if ATA_PFLAG_EH_PENDING is set, and if it is, another iteration
> of ATA EH is performed.
>
> The problem is that ATA_PFLAG_EH_PENDING is not only cleared by
> ata_scsi_port_error_handler(), it is also cleared by ata_eh_reset().
>
> ata_eh_reset() is called by ap->ops->error_handler(). This additional
> clearing done by ata_eh_reset() breaks the whole retry logic in
> ata_scsi_port_error_handler(). Thus, if an error IRQ is received while
> ap->ops->error_handler() is running, the port will currently remain
> frozen and will never get re-enabled.
>
> The additional clearing in ata_eh_reset() was introduced in commit
> 1e641060c4b5 ("libata: clear eh_info on reset completion").
>
> Looking at the original error report:
> https://marc.info/?l=linux-ide&m=124765325828495&w=2
>
> We can see the following happening:
> [ 1.074659] ata3: XXX port freeze
> [ 1.074700] ata3: XXX hardresetting link, stopping engine
> [ 1.074746] ata3: XXX flipping SControl
>
> [ 1.411471] ata3: XXX irq_stat=400040 CONN|PHY
> [ 1.411475] ata3: XXX port freeze
>
> [ 1.420049] ata3: XXX starting engine
> [ 1.420096] ata3: XXX rc=0, class=1
> [ 1.420142] ata3: XXX clearing IRQs for thawing
> [ 1.420188] ata3: XXX port thawed
> [ 1.420234] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>
> We are not supposed to be able to receive an error IRQ while the port is
> frozen (PxIE is set to 0, i.e. all IRQs for the port are disabled).
>
> AHCI 1.3.1 section 10.7.1.1 First Tier (IS Register) states:
> "Each bit location can be thought of as reporting a '1' if the virtual
> "interrupt line" for that port is indicating it wishes to generate an
> interrupt. That is, if a port has one or more interrupt status bit set,
> and the enables for those status bits are set, then this bit shall be set."
>
> Additionally, AHCI state P:ComInit clearly shows that the state machine
> will only jump to P:ComInitSetIS (which sets IS.IPS(x) to '1'), if PxIE.PCE
> is set to '1'. In our case, PxIE is set to 0, so IS.IPS(x) won't get set.
>
> So IS.IPS(x) only gets set if PxIS and PxIE is set.
>
> AHCI 1.3.1 section 10.7.1.1 First Tier (IS Register) also states:
> "The bits in this register are read/write clear. It is set by the level of
> the virtual interrupt line being a set, and cleared by a write of '1' from
> the software."
>
> So if IS.IPS(x) is set, you need to explicitly clear it by writing a 1 to
> IS.IPS(x) for that port.
>
> Since PxIE is cleared, the only way to get an interrupt while the port is
> frozen, is if IS.IPS(x) is set, and the only way IS.IPS(x) can be set when
> the port is frozen, is if it was set before the port was frozen.
>
> However, since commit 737dd811a3db ("ata: libahci: clear pending interrupt
> status"), we clear both PxIS and IS.IPS(x) after freezing the port, but
> before the COMRESET, so the problem that commit 1e641060c4b5 ("libata:
> clear eh_info on reset completion") fixed can no longer happen.
>
> Thus, revert commit 1e641060c4b5 ("libata: clear eh_info on reset
> completion"), so that the retry logic in ata_scsi_port_error_handler()
> works once again. (The retry logic is still needed, since we can still
> get an error IRQ _after_ the port has been thawed, but before
> ata_scsi_port_error_handler() takes the ap->lock in order to check
> if ATA_PFLAG_EH_PENDING is set.)
>
> Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx>
> Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx>
> Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
> ---
> drivers/ata/libata-eh.c | 13 +++----------
> 1 file changed, 3 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> index cbe9af624a06f..8a789de056807 100644
> --- a/drivers/ata/libata-eh.c
> +++ b/drivers/ata/libata-eh.c
> @@ -2948,18 +2948,11 @@ int ata_eh_reset(struct ata_link *link, int classify,
> postreset(slave, classes);
> }
>
> - /*
> - * Some controllers can't be frozen very well and may set spurious
> - * error conditions during reset. Clear accumulated error
> - * information and re-thaw the port if frozen. As reset is the
> - * final recovery action and we cross check link onlineness against
> - * device classification later, no hotplug event is lost by this.
> - */
> + /* clear cached SError */
> spin_lock_irqsave(link->ap->lock, flags);
> - memset(&link->eh_info, 0, sizeof(link->eh_info));
> + link->eh_info.serror = 0;
> if (slave)
> - memset(&slave->eh_info, 0, sizeof(link->eh_info));
> - ap->pflags &= ~ATA_PFLAG_EH_PENDING;
> + slave->eh_info.serror = 0;
> spin_unlock_irqrestore(link->ap->lock, flags);
>
> if (ap->pflags & ATA_PFLAG_FROZEN)
> --
> 2.40.1
>