[REGRESSION] suspend to ram fails in 6.1 due to tpm errors

From: Vlastimil Babka
Date: Mon Nov 28 2022 - 03:15:43 EST


Hi,

I've noticed my Lenovo T460 developed a failure to suspend to ram related to
TPM errors. Seems at each suspend/resume cycle there's a chance the errors
develop and then further suspends are blocked by the stuck TPM (or its driver?)
I can say for sure it never happened before 6.1, however I didn't test all 6.1
RCs and suspend/resume enough times to pinpoint it was 6.1-rc1 (which would be
naturally suspicious). Bisecting would also be time consuming and unreliable,
so hopefully this historical account will be sufficient:

The tpm messages on boot are always the same, here from 6.0.0-rc7:

tpm_tis 00:08: 1.2 TPM (device-id 0x1B, rev-id 16)
tpm tpm0:a TPM is disabled/deactivated (0x6)
tpm tpm0: tpm_read_log_acpi: TCPA log area empty

And normally during resume from suspend to ram I can see:

tpm tpm0: TPM is disabled/deactivated (0x6)

With 6.1-rc3 (the first 6.1-rcX I've tried on this laptop) this was still
behaving OK, suspend/resume went fine 4 times until I updated the kernel
and rebooted. Maybe it just wasn't enough cycles to hit the issue.

With 6.1-rc4, there were initially 3 resumes OK, but on 4th resume I saw:

tpm tpm0: tpm_try_transmit: send(): error -5
tpm tpm0: invalid TPM_STS.x 0xff, dumping stack for forensics
CPU: 2 PID: 15299 Comm: systemd-sleep Not tainted 6.1.0-rc4-2.gc03e512-default #1 openSUSE Tumbleweed (unreleased) 232cc11569ae1616983f707f1010e2c19601c7ee
Hardware name: LENOVO 20FMS27W03/20FMS27W03, BIOS R06ET71W (1.45 ) 02/21/2022
Call Trace:
<TASK>
dump_stack_lvl+0x44/0x5c
tpm_tis_status.cold+0x19/0x1e
wait_for_tpm_stat+0x120/0x200
? tpm_tcg_read_bytes+0x8f/0xa0
tpm_tis_send_data+0x8c/0x260
tpm_tis_send_main+0x33/0x130
tpm_transmit+0xd4/0x3e0
tpm_transmit_cmd+0x25/0x90
tpm1_do_selftest+0x88/0x130
? set_next_entity+0xda/0x150
tpm_tis_resume+0x8d/0x1b0
pnp_bus_resume+0x60/0xa0
? pnp_bus_suspend+0x10/0x10
dpm_run_callback+0x47/0x150
device_resume+0x104/0x270
? dpm_show_time.cold+0x62/0x62
dpm_resume+0x191/0x2e0
dpm_resume_end+0xd/0x20
suspend_devices_and_enter+0x1c3/0x880
pm_suspend.cold+0x2d2/0x35e
state_store+0x68/0xd0
kernfs_fop_write_iter+0x11b/0x1f0
vfs_write+0x1f4/0x3b0
ksys_write+0x63/0xe0
do_syscall_64+0x58/0x80
? do_syscall_64+0x67/0x80
? do_syscall_64+0x67/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f6197f079d4
Code: ff eb b7 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 90 90 80 3d cd 0f 0f 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 48 89 54 24 18 48
RSP: 002b:00007ffdaf9a92d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f6197f079d4
RDX: 0000000000000004 RSI: 00007ffdaf9a93c0 RDI: 0000000000000004
RBP: 00007ffdaf9a93c0 R08: 0000563fe8e33650 R09: 0000000000000073
R10: 00000000ffffffff R11: 0000000000000202 R12: 0000000000000004
R13: 0000563fe8e152d0 R14: 0000000000000004 R15: 00007f6197fe69e0
</TASK>

and a second later

tpm tpm0: tpm_try_transmit: send(): error -62

But that appears not to be the main issue. I've just noticed it while gathering
the info now, and didn't notice it back then as there were another 5 resumes ok
continuing the same kernel boot.

But then on another resume I got:

tpm tpm0: A TPM error (28) occurred continue selftest

And afterwards, many messages scattered in the log:

tpm tpm0: A TPM error (28) occurred attempting get random

And since then, suspend to ram no longer works and I see this:

tpm tpm0: Error (28) sending savestate before suspend
tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
tpm_tis 00:08: PM: failed to suspend: error 28
PM: Some devices failed to suspend, or early wake event detected

After reboot to 6.1-rc6, initially 3 resumes ok, and then again on 4th resume:

tpm tpm0: A TPM error (28) occurred continue selftest

and same story with the errors attempting get random, and suspend failing.
Notably this was without the tpm_try_transmit splat above, so that is
probably indeed not tha main issue. The moment things go wrong is
the "A TPM error (28) occurred continue selftest" during resume.

Dominik on IRC pointed me to commit b006c439d58d ("hwrng: core - start hwrng
kthread also for untrusted sources"), which could make sense if the TPM was not
used at all before and now it's used for randomness. But then it probably "just"
uncovered a pre-existing issue? Maybe there's a race with getting the randomness
and suspend? Could it be exactly what this patch is attempting to fix?
https://lore.kernel.org/all/20221103145450.1409273-2-jsd@xxxxxxxxxxxx/

Thanks,
Vlastimil

#regzbot introduced: v6.0..v6.1-rc4