Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

From: Bjorn Helgaas
Date: Mon Feb 07 2022 - 11:45:11 EST


On Sat, Feb 05, 2022 at 09:30:07AM -0800, Kenneth R. Crudup wrote:
> > > If you'd like, I could try re-applying the previous problem
> > > commit or your attempted fix on top of Linus' master if you'd
> > > like to see if something was fixed somewhere else in the PCIe
> > > subsystem, but if you think it's not worth- while I'm satisfied
> > > with the current fix (or probably more-exactly for my particular
> > > machine, lack of regression).
>
> On Sat, 5 Feb 2022, Vidya Sagar wrote:
>
> > That would be a good starting point to understand it better. In fact if the
> > previous problematic patch works fine on master, then, we are sure that
> > something in the sub-system would have fixed the issue.
>
> So this is my report of the regression I'd found with Bjorn's original commit:
> ----
> Date: Fri, 25 Dec 2020 16:38:56
> From: Kenneth R. Crudup <kenny@xxxxxxxxx>
> To: vidyas@xxxxxxxxxx
> Cc: bhelgaas@xxxxxxxxxx
> Subject: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume
> failures
>
> I've been running Linus' master branch on my laptop (Dell XPS 13 2-in-1). With
> this commit in place, after resuming from hibernate my machine is essentially
> useless, with a torrent of disk I/O errors on my NVMe device (at least, and
> possibly other devices affected) until a reboot.
>
> I do use tlp to set the PCIe ASPM to "performance" on AC and "powersupersave"
> on battery.
>
> Let me know if you need more information.
> ----
>
> I just reapplied it on top of Linus' master and not only did it go
> in cleanly(!), NOW I'm not getting any issues after a
> suspend/resume.

So on 12/25/2020 (just before v5.11-rc1), you saw I/O errors after
resume from hibernate, and you apparently went to the trouble to
bisect it to 4257f7e008ea ("PCI/ASPM: Save/restore L1SS Capability for
suspend/resume").

We reverted 4257f7e008ea, and the revert appeared in v5.11-rc7.

I assume you re-applied 4257f7e008ea ("PCI/ASPM: Save/restore L1SS
Capability for suspend/resume") on top of something between v5.17-rc2
and v5.17-rc3, and you don't see those I/O errors.

It's possible something was fixed elsewhere between v5.11-rc1 and
v5.17-rc2, but I'm not really convinced by that theory.

I think it's more likely that something changed in BIOS settings or
other configuration, which means other people may trip over it even if
you don't see it. Do you remember any BIOS updates, BIOS setup
tweaks, hardware changes, kernel parameter changes, etc?

If the problem really was fixed by some change elsewhere, it *should*
still happen on v5.11-rc1. I think we should verify that and try to
figure out what the other change was. People who want to backport the
L1SS save/restore will need to know that anyway.

Bjorn