Re: [PATCH 5.17 127/298] driver core: Fix wait_for_device_probe() & deferred_probe_timeout interaction

From: Saravana Kannan
Date: Wed Aug 23 2023 - 17:01:34 EST


On Tue, Aug 22, 2023 at 7:10 AM Shreeya Patel
<shreeya.patel@xxxxxxxxxxxxx> wrote:
>
> Hi Robin,
>
> On 21/08/23 18:09, Robin Murphy wrote:
> > On 2023-08-21 12:35, Shreeya Patel wrote:
> >>
> >> On 19/08/23 01:49, Saravana Kannan wrote:
> >>> On Thu, Aug 17, 2023 at 4:13 PM Shreeya Patel
> >>> <shreeya.patel@xxxxxxxxxxxxx> wrote:
> >>>> Hi Geert, Saravana,
> >>>>
> >>>> On 18/08/23 00:03, Saravana Kannan wrote:
> >>>>> On Thu, Aug 17, 2023 at 4:37 AM Shreeya Patel
> >>>>> <shreeya.patel@xxxxxxxxxxxxx> wrote:
> >>>>>> Hi Greg,
> >>>>>>
> >>>>>> On 16/08/23 20:33, Greg Kroah-Hartman wrote:
> >>>>>>> On Wed, Aug 16, 2023 at 03:09:27PM +0530, Shreeya Patel wrote:
> >>>>>>>> On 13/06/22 15:40, Greg Kroah-Hartman wrote:
> >>>>>>>>> From: Saravana Kannan<saravanak@xxxxxxxxxx>
> >>>>>>>>>
> >>>>>>>>> [ Upstream commit 5ee76c256e928455212ab759c51d198fedbe7523 ]
> >>>>>>>>>
> >>>>>>>>> Mounting NFS rootfs was timing out when deferred_probe_timeout
> >>>>>>>>> was
> >>>>>>>>> non-zero [1]. This was because ip_auto_config() initcall
> >>>>>>>>> times out
> >>>>>>>>> waiting for the network interfaces to show up when
> >>>>>>>>> deferred_probe_timeout was non-zero. While ip_auto_config() calls
> >>>>>>>>> wait_for_device_probe() to make sure any currently running
> >>>>>>>>> deferred
> >>>>>>>>> probe work or asynchronous probe finishes, that wasn't
> >>>>>>>>> sufficient to
> >>>>>>>>> account for devices being deferred until deferred_probe_timeout.
> >>>>>>>>>
> >>>>>>>>> Commit 35a672363ab3 ("driver core: Ensure
> >>>>>>>>> wait_for_device_probe() waits
> >>>>>>>>> until the deferred_probe_timeout fires") tried to fix that by
> >>>>>>>>> making
> >>>>>>>>> sure wait_for_device_probe() waits for deferred_probe_timeout
> >>>>>>>>> to expire
> >>>>>>>>> before returning.
> >>>>>>>>>
> >>>>>>>>> However, if wait_for_device_probe() is called from the
> >>>>>>>>> kernel_init()
> >>>>>>>>> context:
> >>>>>>>>>
> >>>>>>>>> - Before deferred_probe_initcall() [2], it causes the boot
> >>>>>>>>> process to
> >>>>>>>>> hang due to a deadlock.
> >>>>>>>>>
> >>>>>>>>> - After deferred_probe_initcall() [3], it blocks kernel_init()
> >>>>>>>>> from
> >>>>>>>>> continuing till deferred_probe_timeout expires and beats
> >>>>>>>>> the point of
> >>>>>>>>> deferred_probe_timeout that's trying to wait for
> >>>>>>>>> userspace to load
> >>>>>>>>> modules.
> >>>>>>>>>
> >>>>>>>>> Neither of this is good. So revert the changes to
> >>>>>>>>> wait_for_device_probe().
> >>>>>>>>>
> >>>>>>>>> [1]
> >>>>>>>>> -https://lore.kernel.org/lkml/TYAPR01MB45443DF63B9EF29054F7C41FD8C60@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> >>>>>>>>> [2]
> >>>>>>>>> -https://lore.kernel.org/lkml/YowHNo4sBjr9ijZr@dev-arch.thelio-3990X/
> >>>>>>>>>
> >>>>>>>>> [3] -https://lore.kernel.org/lkml/Yo3WvGnNk3LvLb7R@xxxxxxxxxxxxx/
> >>>>>>>> Hi Saravana, Greg,
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> KernelCI found this patch causes the
> >>>>>>>> baseline.bootrr.deferred-probe-empty test to fail on
> >>>>>>>> r8a77960-ulcb,
> >>>>>>>> see the following details for more information.
> >>>>>>>>
> >>>>>>>> KernelCI dashboard link:
> >>>>>>>> https://linux.kernelci.org/test/plan/id/64d2a6be8c1a8435e535b264/
> >>>>>>>>
> >>>>>>>> Error messages from the logs :-
> >>>>>>>>
> >>>>>>>> + UUID=11236495_1.5.2.4.5
> >>>>>>>> + set +x
> >>>>>>>> + export
> >>>>>>>> 'PATH=/opt/bootrr/libexec/bootrr/helpers:/lava-11236495/1/../bin:/sbin:/usr/sbin:/bin:/usr/bin'
> >>>>>>>> + cd /opt/bootrr/libexec/bootrr
> >>>>>>>> + sh helpers/bootrr-auto
> >>>>>>>> e6800000.ethernet
> >>>>>>>> e6700000.dma-controller
> >>>>>>>> e7300000.dma-controller
> >>>>>>>> e7310000.dma-controller
> >>>>>>>> ec700000.dma-controller
> >>>>>>>> ec720000.dma-controller
> >>>>>>>> fea20000.vsp
> >>>>>>>> feb00000.display
> >>>>>>>> fea28000.vsp
> >>>>>>>> fea30000.vsp
> >>>>>>>> fe9a0000.vsp
> >>>>>>>> fe9af000.fcp
> >>>>>>>> fea27000.fcp
> >>>>>>>> fea2f000.fcp
> >>>>>>>> fea37000.fcp
> >>>>>>>> sound
> >>>>>>>> ee100000.mmc
> >>>>>>>> ee140000.mmc
> >>>>>>>> ec500000.sound
> >>>>>>>> /lava-11236495/1/../bin/lava-test-case
> >>>>>>>> <8>[ 17.476741] <LAVA_SIGNAL_TESTCASE
> >>>>>>>> TEST_CASE_ID=deferred-probe-empty RESULT=fail>
> >>>>>>>>
> >>>>>>>> Test case failing :-
> >>>>>>>> Baseline Bootrr deferred-probe-empty test
> >>>>>>>> -https://github.com/kernelci/bootrr/blob/main/helpers/bootrr-generic-tests
> >>>>>>>>
> >>>>>>>> Regression Reproduced :-
> >>>>>>>>
> >>>>>>>> Lava job after reverting the commit 5ee76c256e92
> >>>>>>>> https://lava.collabora.dev/scheduler/job/11292890
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Bisection report from KernelCI can be found at the bottom of
> >>>>>>>> the email.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Shreeya Patel
> >>>>>>>>
> >>>>>>>> #regzbot introduced: 5ee76c256e92
> >>>>>>>> #regzbot title: KernelCI: Multiple devices deferring on
> >>>>>>>> r8a77960-ulcb
> >>>>>>>>
> >>>>>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * **
> >>>>>>>> * If you do send a fix, please include this trailer: *
> >>>>>>>> * Reported-by: "kernelci.org bot" <bot@...> *
> >>>>>>>> * *
> >>>>>>>> * Hope this helps! *
> >>>>>>>> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
> >>>>>>>>
> >>>>>>>> stable-rc/linux-5.10.y bisection:
> >>>>>>>> baseline.bootrr.deferred-probe-empty on
> >>>>>>>> r8a77960-ulcb
> >>>>>>> You are testing 5.10.y, yet the subject says 5.17?
> >>>>>>>
> >>>>>>> Which is it here?
> >>>>>> Sorry, I accidentally used the lore link for 5.17 while reporting
> >>>>>> this
> >>>>>> issue,
> >>>>>> but this test does fail on all the stable releases from 5.10
> >>>>>> onwards.
> >>>>>>
> >>>>>> stable 5.15 :-
> >>>>>> https://linux.kernelci.org/test/case/id/64dd156a5ac58d0cf335b1ea/
> >>>>>> mainline :-
> >>>>>> https://linux.kernelci.org/test/case/id/64dc13d55cb51357a135b209/
> >>>>>>
> >>>>> Shreeya, can you try the patch Geert suggested and let us know if it
> >>>>> helps? If not, then I can try to take a closer look.
> >>>> I tried to test the kernel with 9be4cbd09da8 but it didn't change the
> >>>> result.
> >>>> https://lava.collabora.dev/scheduler/job/11311615
> >>>>
> >>>> Also, I am not sure if this can change things but just FYI, KernelCI
> >>>> adds some kernel parameters when running these tests and one of the
> >>>> parameter is deferred_probe_timeout=60.
> >>> Ah this is good to know.
> >>>
> >>>> You can check this in the definition details given in the Lava job. I
> >>>> also tried to remove this parameter and rerun the test but again I got
> >>>> the same result.
> >>> How long does the test wait after boot before checking for the
> >>> deferred devices list?
> >>>
> >>
> >> AFAIK, script for running the tests is immediately ran after the boot
> >> process is complete so there is no wait time.
> >
> > Regardless of what the kernel is doing, it seems like a fundamentally
> > dumb test to specifically ask deferred probe to wait for up to a
> > minute then complain that it hasn't finished after 11 seconds :/
> >
> > If anything, it seems plausible that the "regression" might actually
> > be the correct behaviour, and it was wrong before. I can't manage to
> > pull up a boot log for a pre-5.10 kernel since all the async stuff on
> > the KernelCI dashboard always just times out for me with a helpful
> > "Error while loading data from the server (error code: 0)", but what
> > would be interesting is whether those devices on the list are expected
> > to successfully probe anyway - the mainline log below also shows other
> > stuff failing to probe and CPUs failing to come online, so it's
> > clearly not a very happy platform to begin with.
> >
>
> Sorry about the dashboard issues you are facing, KernelCI team is
> working on a new dashboard which will fix all of these issues. But we
> need to wait for it be ready for some more time.
>
>
> Your point makes sense and that is why we did a test to add some sleep
> time of 60-65 seconds before running the tests and it actually fixed the
> problem. There are no more deferred devices as you can see in the
> following job.
> https://lava.collabora.dev/scheduler/job/11330931
>
> This change, to add deferred_probe_timeout=60 as kernel parameter was
> recently added in KernelCI since there were number of devices failing to
> probe on different platforms, specifically for chromebooks.
> Unfortunately, no one realized to change the start of the test time and
> hence these issues are seen now on different platforms.
>
> Thanks for pointing it out, it will help us to eliminate quite many
> deferred probe test failures that we are seeing.
>
> Saravana, thanks to you as well for asking the valid questions about
> when the test starts running.

Glad we sorted this out. Thanks for maintaining the tests.

-Saravana