Re: Bug: After a 'warm' reboot the disk is missing (not detected by the bios) on a HP t640

From: Sean Christopherson
Date: Tue Jan 02 2024 - 12:59:05 EST


On Thu, Dec 28, 2023, Ben Mesman | Spark Narrowcasting wrote:
> > Please don't send private mails.  Kudos for using get_maintainer.pl, but a demerit
> > for not Cc'ing the mailing lists :-)
> >
> > https://people.kernel.org/tglx/notes-about-netiquette
>
> Definitely saving that in my URLs-cache. Might need it again in a few years :-)
> (last time I needed it was about 10 years ago)
>
> > > I recently started upgrading some of my remote managed thin-clients from a
> > > 5.15.x kernel to a 6.1.x kernel. When rebooting with the new(er) kernel, the
> > > HP t640 clients failed. The problem is that after the warm reboot, the BIOS
> > > is unable to locate the internal storage (so it can't boot a valid OS).
> > >
> > > With some digging around I found that adding "reboot=p" will solve the
> > > problem, but because the systems are remote managed, I am unable to add this
> > > boot-parameter in any straightforward way.
> [snip]
> > I'm not familiar with this code (I'm not actually a maintainer/reviewer for this
> > code, by default get_maintainer.pl Cc's people that have recently modified the
> > file in question), but this looks like a hack to workaround a bug elsewhere.
> >
> > All of these quirks are obviously workarounds for some kind of bug, but AFAICT
> > the quirks are to workaround hardware or firmware bugs, not kernel bugs.  Since
> > 5.15.x kernels worked, odds are good a bug was introduced between 5.15 and 6.1,
> > i.e. that this is fudging around a kernel bug that can and should be fixed.
> >
> > Are you able to bisect the kernel between 6.1 and 5.15 to try and pinpoint an
> > exact commit that introduced the problem?
>
> That took a few days, but resulted in the following:
>
> 4be33cf187036744b4ed84824e7157cfc09c6f4c is the first bad commit
> commit 4be33cf187036744b4ed84824e7157cfc09c6f4c
> Author: Fred Ai <fred.ai@xxxxxxxxxxxxxx>
> Date: Mon Dec 20 20:09:40 2021 -0800
>
> mmc: sdhci-pci-o2micro: Improve card input timing at SDR104/HS200 mode
>
> Card input timing is margin, need to adjust the hold timing of card input.
>
> Signed-off-by: Fred Ai <fred.ai@xxxxxxxxxxxxxx>
> Link: https://lore.kernel.org/r/20211221040940.484-1-fred.ai@xxxxxxxxxxxxxx
> Signed-off-by: Ulf Hansson <ulf.hansson@xxxxxxxxxx>
>
> drivers/mmc/host/sdhci-pci-o2micro.c | 57 ++++++++++++++++++++++++++++++------
> 1 file changed, 48 insertions(+), 9 deletions(-)
>
> I'm not sure how this code impacts this device, but it does contain a "HS200 MMC card":
> $ dmesg | grep mmc
> [ 1.044708] mmc0: emmc 1.8v flag is set, force 1.8v signaling voltage
> [ 1.044937] mmc0: SDHCI controller on PCI [0000:01:00.0] using ADMA
> [ 2.120632] mmc0: new HS200 MMC card at address 0001
> [ 2.122912] mmcblk0: mmc0:0001 hA8aP> 14.7 GiB
> [ 2.124810] mmcblk0: p1 p2 p3
>
> I can provide more info on the hardware, which is also available in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1056056

Adding the relevant people from that commit, this is waaaaay outside my area of
expertise.