Re: Scsi_bus_resume+0x0/0x90 returns -5 when resuming from s3 sleep

From: Damien Le Moal
Date: Wed Jul 26 2023 - 19:39:18 EST


On 7/26/23 22:47, Thorsten Leemhuis wrote:
> Hi, Thorsten here, the Linux kernel's regression tracker.
>
> On 26.07.23 13:54, TW wrote:
>> I have been having issues with the 6.x series of kernels resuming from
>> suspend with one of my drives. Far as I can tell it has trouble with the
>> cache on the drive when coming out of s3 sleep. Tried a few different
>> distros (Manjaro, OpenMandriva Rome, EndeavourOS) all that give the same
>> error message. It appears to work fine on the 5.15 kernel just fine
>> however.
>>
>> This is the error or errors that I have been getting and assume has been
>> holding up the system from resuming from suspend.
>>
>> Jul 20 04:13:41 rageworks kernel: ata10.00: device reported invalid CHS sector 0
>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command

This sense is garbage. This issue was reported already, but it is hard
to deal with as it seems to be due to drives/adapters not correctly
reporting status bits. So for now, let's ignore this sense codes.

The start/stop unit failure is weird. On another case, I am suspecting
that this command is causing a delay on resume, but not an error like this.

>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: PM: failed to resume async: error -5
>
> Thx for your report. I CCed a few people, with a bit of luck they have
> an idea. But I doubt it. If no one replies you likely will need a
> bisection to find the root of the problem. But before going down that
> route you want to check if latest mainline kernel (vanilla!) works better.
>
> FWIW, this is not my area of expertise, so the following might be a
> misleading comment, but the problem looks somewhat similar to this one
> that iirc was never solved:
> https://bugzilla.kernel.org/show_bug.cgi?id=216087
>
>> Jul 20 04:12:51 rageworks systemd[1]: nvidia-suspend.service: Deactivated successfully.
>> Jul 20 04:12:51 rageworks systemd[1]: Finished NVIDIA system suspend actions.
>> Jul 20 04:12:51 rageworks systemd[1]: Starting System Suspend...
>
> That sounds like you are using out-of tree drivers which can cause all
> sorts of issues. Please recheck if the problem happens without those as
> well and do not use them in all further tests to debug the issue.

Yes. Please retest with the latest 6.5-rc3.

And can you try this patch to see if it solves your issue ?

commit 29e81d11812ee924d19425343ec69acd34af9d35
Author: Damien Le Moal <dlemoal@xxxxxxxxxx>
Date: Mon Jul 24 13:23:14 2023 +0900

ata,scsi: do not issue START STOP UNIT on resume

Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx>

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 370d18aca71e..6184c7bcc16c 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1100,7 +1100,13 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct
ata_device *dev)
}
} else {
sdev->sector_size = ata_id_logical_sector_size(dev->id);
+ /*
+ * Stop the drive on suspend but do not issue START STOP UNIT
+ * on resume as this is not necessary: the port is reset on
+ * resume, which wakes up the drive.
+ */
sdev->manage_start_stop = 1;
+ sdev->no_start_on_resume = 1;
}

/*
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 68b12afa0721..b8584fe3123e 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3876,7 +3876,7 @@ static int sd_suspend_runtime(struct device *dev)
static int sd_resume(struct device *dev)
{
struct scsi_disk *sdkp = dev_get_drvdata(dev);
- int ret;
+ int ret = 0;

if (!sdkp) /* E.g.: runtime resume at the start of sd_probe() */
return 0;
@@ -3885,7 +3885,8 @@ static int sd_resume(struct device *dev)
return 0;

sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
- ret = sd_start_stop_device(sdkp, 1);
+ if (!sdkp->device->no_start_on_resume)
+ ret = sd_start_stop_device(sdkp, 1);
if (!ret)
opal_unlock_from_suspend(sdkp->opal_dev);
return ret;
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 75b2235b99e2..b9230b6add04 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -194,6 +194,7 @@ struct scsi_device {
unsigned no_start_on_add:1; /* do not issue start on add */
unsigned allow_restart:1; /* issue START_UNIT in error handler */
unsigned manage_start_stop:1; /* Let HLD (sd) manage start/stop */
+ unsigned no_start_on_resume:1; /* Do not issue START_STOP_UNIT on resume */
unsigned start_stop_pwr_cond:1; /* Set power cond. in START_STOP_UNIT */
unsigned no_uld_attach:1; /* disable connecting to upper level drivers */
unsigned select_no_atn:1;


--
Damien Le Moal
Western Digital Research