Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

From: Dave Hansen
Date: Fri Sep 16 2016 - 12:25:17 EST


On 09/16/2016 06:58 AM, Rui Teng wrote:
> On 9/15/16 12:37 AM, Dave Hansen wrote:
>> On 09/14/2016 09:33 AM, Rui Teng wrote:
>> But, as far as describing the initial problem, can you explain how the
>> tail pages still ended up being PageHuge()? Seems like dissolving the
>> huge page should have cleared that.
>>
> I use the scripts of tools/testing/selftests/memory-hotplug/mem-on-
> off-test.sh to test and reproduce this bug. And I printed the pfn range
> on dissolve_free_huge_pages(). The sizes of the pfn range are always
> 4096, and the ranges are separated.
> [ 72.362427] start_pfn: 204800, end_pfn: 208896
> [ 72.371677] start_pfn: 2162688, end_pfn: 2166784
> [ 72.373945] start_pfn: 217088, end_pfn: 221184
> [ 72.383218] start_pfn: 2170880, end_pfn: 2174976
> [ 72.385918] start_pfn: 2306048, end_pfn: 2310144
> [ 72.388254] start_pfn: 2326528, end_pfn: 2330624
>
> Sometimes, it will report a failure:
> [ 72.371690] memory offlining [mem 0x2100000000-0x210fffffff] failed
>
> And sometimes, it will report following:
> [ 72.373956] Offlined Pages 4096
>
> Whether the start_pfn and end_pfn of dissolve_free_huge_pages could be
> *random*? If so, the range may not include any page head and start from
> tail page, right?

That's an interesting data point, but it still doesn't quite explain
what is going on.

It seems like there might be parts of gigantic pages that have
PageHuge() set on tail pages, while other parts don't. If that's true,
we have another bug and your patch just papers over the issue.

I think you really need to find the root cause before we apply this patch.