Re: printk.time causes rare kernel boot hangs

From: YiFei Zhu
Date: Thu Jun 15 2023 - 07:04:37 EST


> FWIW attached is a test program that runs the qemu instances in
> parallel (up to 8 threads), which seems to be a quicker way to hit the
> problem for me. Even on Intel, with this test I can hit the bug in a
> few hundred iteration.

A friend sent me here so I took a look.

I was unable to reproduce with this script after 10000 iterations,
on a AMD Gentoo Linux host:

Host kernel: 6.3.3 vanilla
Guest kernel: git commit f31dcb152a3d0816e2f1deab4e64572336da197d
Guest config: Provided full-fat Fedora config + CONFIG_GDB_SCRIPTS
QEMU: 8.0.2 (with kvm_amd)
Hardware: AMD Ryzen 7 PRO 5850U

I wonder if anything on the host side affects this, or could be some
sort of race condition.

> So is sleeping in kernel_init_freeable valid? It seems as if it
> wouldn't be an atomic context. And is the fact that the failure looks
> precisely the same coincidence?

I did a quick test on this and was able to reproduce. Attaching gdb
shows the CPU is idling in:
default_idle -> arch_safe_halt -> pv_native_safe_halt

By booting with an additional "idle=poll" and using gdb to invoke
"p show_state_filter(0)", I was able to dump the stack of kernel_init
at PID 1 (this is before it execve userspace init):

[ 0.322736] task:swapper/0 state:D stack:0 pid:1 ppid:0 flags:0x00004000
[ 0.322736] Call Trace:
[ 0.322736] <TASK>
[ 0.322736] __schedule+0x3eb/0x14c0
[ 0.322736] schedule+0x5e/0xd0
[ 0.322736] schedule_timeout+0x98/0x160
[ 0.322736] ? __pfx_process_timeout+0x10/0x10
[ 0.322736] msleep+0x2d/0x40
[ 0.322736] kernel_init_freeable+0x1f7/0x490
[ 0.322736] ? __pfx_kernel_init+0x10/0x10
[ 0.322736] kernel_init+0x1a/0x1c0
[ 0.322736] ret_from_fork+0x2c/0x50
[ 0.322736] </TASK>

It does not appear that msleep work at this stage. Also interestingly,
the printk timer shown does not reflect the time spent idling (a few
seconds), and shows the exact timestamp as:

[ 0.322736] Freeing SMP alternatives memory: 48K

I'm not exactly sure where msleep would start working. If it's just timers
then "init_timers" has already run my PID 0 at this point I believe.

I also tested a busy delay via:

diff --git a/init/main.c b/init/main.c
index bb87b789c543..f6d722917388 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1616,6 +1616,7 @@ static noinline void __init kernel_init_freeable(void)

cad_pid = get_pid(task_pid(current));

+ for (volatile long i = 0; i < 0x100000000ULL; i++);
smp_prepare_cpus(setup_max_cpus);

workqueue_init();

The kernel does boot and panic (due to lack of rootfs), taking around two
seconds longer. The dmesg printed still does not reflect that there was a
busy wait.

If you can reproduce the original bug (without the msleep or busy wait
patch), could you check if you can reproduce that with idle=poll? If so,
can you run "p show_state_filter(0)" so we get a stack trace of kernel_init,
assuming it hit a similar issue as if msleep was added. If idle=poll does
not work, or you can't call functions from within gdb (some old qemu versions
did not support this), see if you can send a alt-sysrq-w to show stacks of
blocked tasks.

YiFei Zhu