Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]

From: Joel Fernandes
Date: Wed Mar 13 2024 - 12:02:11 EST


Hello Florian,

On Tue, Mar 12, 2024 at 7:48 PM Boqun Feng <boqun.feng@xxxxxxxxx> wrote:
>
> On Tue, Mar 12, 2024 at 02:44:14PM -0700, Linus Torvalds wrote:
> > On Tue, 12 Mar 2024 at 14:34, Florian Fainelli <f.fainelli@xxxxxxxxx> wrote:
> > >
> > > and here is a log where this fails:
> > >
> > > https://gist.github.com/ffainelli/ed08a2b3e853f59343786ebd20364fc8
> >
> > You could try the 'initcall_debug' kernel command line.
> >
>
> Right, that'll be helpful.
>
> Besides I took a look at the config Florian shared, no TASKS_RCU,
> RCU_LAZY or RCU nocb is enabled. So probably the only left changes in
> the PR are around RCU expedited. Florian, could you see if you can build
> and test with CONFIG_PROVE_RCU=y (you need to select
> CONFIG_PROVE_LOCKING for that)? That'll call synchronize_rcu() +
> synchronize_rcu_expedited() before and after we switch
> rcu_scheduler_active, and it may provide more information. Thanks!

Adding to everyone's suggestions, could you also try booting with
"rcupdate.rcu_normal=1" ? This will disable expedited RCU and help us
further confirm that it is indeed expedited RCU (and then we can look
into fixing that).

Also there are 2 additional users of expedited RCU in this release I noticed:

78c3253f27e5 ("net: use synchronize_rcu_expedited in cleanup_net()")
1ebb85f9c03d ("netfilter: conntrack: expedite rcu in
nf_conntrack_cleanup_net_list")

Could you also try reverting those patches as well, and see if the
issue goes away?

thanks,

- Joel

On Tue, Mar 12, 2024 at 7:48 PM Boqun Feng <boqun.feng@xxxxxxxxx> wrote:
>
> On Tue, Mar 12, 2024 at 02:44:14PM -0700, Linus Torvalds wrote:
> > On Tue, 12 Mar 2024 at 14:34, Florian Fainelli <f.fainelli@xxxxxxxxx> wrote:
> > >
> > > and here is a log where this fails:
> > >
> > > https://gist.github.com/ffainelli/ed08a2b3e853f59343786ebd20364fc8
> >
> > You could try the 'initcall_debug' kernel command line.
> >
>
> Right, that'll be helpful.
>
> Besides I took a look at the config Florian shared, no TASKS_RCU,
> RCU_LAZY or RCU nocb is enabled. So probably the only left changes in
> the PR are around RCU expedited. Florian, could you see if you can build
> and test with CONFIG_PROVE_RCU=y (you need to select
> CONFIG_PROVE_LOCKING for that)? That'll call synchronize_rcu() +
> synchronize_rcu_expedited() before and after we switch
> rcu_scheduler_active, and it may provide more information. Thanks!
>
> Regards,
> Boqun
>
> > It will make the above *much* noisier, but it might - thanks to all
> > the new noise - show exactly *what* is being crazy slow to initialize.
> >
> > Because right now it's just radio silence in between those
> >
> > [ 1.926435] bcmgenet f0480000.ethernet: GENET 5.0 EPHY: 0x0000
> > [ 162.148135] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> >
> > things, and that's presumably because some random initcall there just
> > takes forever to time out.
> >
> > Linus