Re: next/master boot bisection: next-20190215 on beaglebone-black

From: Mike Rapoport
Date: Wed Mar 06 2019 - 09:06:03 EST


On Wed, Mar 06, 2019 at 10:14:47AM +0000, Guillaume Tucker wrote:
> On 01/03/2019 23:23, Dan Williams wrote:
> > On Fri, Mar 1, 2019 at 1:05 PM Guillaume Tucker
> > <guillaume.tucker@xxxxxxxxxxxxx> wrote:
> >
> > Is there an early-printk facility that can be turned on to see how far
> > we get in the boot?
>
> Yes, I've done that now by enabling CONFIG_DEBUG_AM33XXUART1 and
> earlyprintk in the command line. Here's the result, with the
> commit cherry picked on top of next-20190304:
>
> https://lava.collabora.co.uk/scheduler/job/1526326
>
> [ 1.379522] ti-sysc 4804a000.target-module: sysc_flags 00000222 != 00000022
> [ 1.396718] Unable to handle kernel paging request at virtual address 77bb4003
> [ 1.404203] pgd = (ptrval)
> [ 1.406971] [77bb4003] *pgd=00000000
> [ 1.410650] Internal error: Oops: 5 [#1] ARM
> [...]
> [ 1.672310] [<c07051a0>] (clk_hw_create_clk.part.21) from [<c06fea34>] (devm_clk_get+0x4c/0x80)
> [ 1.681232] [<c06fea34>] (devm_clk_get) from [<c064253c>] (sysc_probe+0x28c/0xde4)
>
> It's always failing at that point in the code. Also when
> enabling "debug" on the kernel command line, the issue goes
> away (exact same binaries etc..):
>
> https://lava.collabora.co.uk/scheduler/job/1526327
>
> For the record, here's the branch I've been using:
>
> https://gitlab.collabora.com/gtucker/linux/tree/beaglebone-black-next-20190304-debug
>
> The board otherwise boots fine with next-20190304 (SMP=n), and
> also with the patch applied but the shuffle configs set to n.
>
> > Were there any boot *successes* on ARM with shuffling enabled? I.e.
> > clues about what's different about the specific memory setup for
> > beagle-bone-black.
>
> Looking at the KernelCI results from next-20190215, it looks like
> only the BeagleBone Black with SMP=n failed to boot:
>
> https://kernelci.org/boot/all/job/next/branch/master/kernel/next-20190215/
>
> Of course that's not all the ARM boards that exist out there, but
> it's a fairly large coverage already.
>
> As the kernel panic always seems to originate in ti-sysc.c,
> there's a chance it's only visible on that platform... I'm doing
> a KernelCI run now with my test branch to double check that,
> it'll take a few hours so I'll send an update later if I get
> anything useful out of it.
>
> In the meantime, I'm happy to try out other things with more
> debug configs turned on or any potential fixes someone might
> have.

ARM is the only arch that sets ARCH_HAS_HOLES_MEMORYMODEL to 'y'. Maybe the
failure has something to do with it...

Guillaume, can you try this patch:

diff --git a/mm/shuffle.c b/mm/shuffle.c
index 3ce1248..4a04aac 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -58,7 +58,8 @@ module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400);
* For two pages to be swapped in the shuffle, they must be free (on a
* 'free_area' lru), have the same order, and have the same migratetype.
*/
-static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
+static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order,
+ struct zone *z)
{
struct page *page;

@@ -80,6 +81,9 @@ static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
if (!PageBuddy(page))
return NULL;

+ if (!memmap_valid_within(pfn, page, z))
+ return NULL;
+
/*
* ...is the page on the same list as the page we will
* shuffle it with?
@@ -123,7 +127,7 @@ void __meminit __shuffle_zone(struct zone *z)
* page_j randomly selected in the span @zone_start_pfn to
* @spanned_pages.
*/
- page_i = shuffle_valid_page(i, order);
+ page_i = shuffle_valid_page(i, order, z);
if (!page_i)
continue;

@@ -137,7 +141,7 @@ void __meminit __shuffle_zone(struct zone *z)
j = z->zone_start_pfn +
ALIGN_DOWN(get_random_long() % z->spanned_pages,
order_pages);
- page_j = shuffle_valid_page(j, order);
+ page_j = shuffle_valid_page(j, order, z);
if (page_j && page_j != page_i)
break;
}


--
Sincerely yours,
Mike.