Re: printk.time causes rare kernel boot hangs

From: Richard W.M. Jones
Date: Thu Jun 15 2023 - 03:50:46 EST


On Thu, Jun 15, 2023 at 09:40:40AM +0200, Alexandre Belloni wrote:
> Hello,
>
> On 14/06/2023 18:34:30+0100, Richard W.M. Jones wrote:
> >
> > FWIW attached is a test program that runs the qemu instances in
> > parallel (up to 8 threads), which seems to be a quicker way to hit the
> > problem for me. Even on Intel, with this test I can hit the bug in a
> > few hundred iteration.
> >
>
> I'm just chiming in to say that we do hit the same issue on the Yocto
> Project CI. We are using qemu 8.0.0 on Intel hardware and a 6.1 kernel.
>
> I see that f31dcb152a3d0816e2f1deab4e64572336da197d hasn't been
> backported so it may not be the culprit. However, this seems to have
> started happening when we switched from 5.15 to 6.1.

I don't know if it's related or not, or even valid, but it was pointed
out to me[1] that you can get the exact same failure this way:

- Linux git @ b6dad5178ceaf23f369c3711062ce1f2afc33644
- Revert f31dcb152a3d0816e2f1deab4e64572336da197d
- Add the following patch:

diff --git a/init/main.c b/init/main.c
index af50044deed5..c2774865a83f 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1552,6 +1552,7 @@ static noinline void __init kernel_init_freeable(void)

cad_pid = get_pid(task_pid(current));

+ msleep(1);
smp_prepare_cpus(setup_max_cpus);

workqueue_init();

So is sleeping in kernel_init_freeable valid? It seems as if it
wouldn't be an atomic context. And is the fact that the failure looks
precisely the same coincidence?

Rich.

[1] https://news.ycombinator.com/item?id=36336059

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines. Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top