Re: [RFC PATCH v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

From: Toshiki Fukasawa
Date: Fri Sep 06 2019 - 06:03:47 EST


Thank you for your feedback.

On 2019/09/06 17:45, David Hildenbrand wrote:
> On 06.09.19 10:09, Toshiki Fukasawa wrote:
>> A kernel panic is observed during reading
>> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
>> pmem namespace:
>>
>> BUG: unable to handle page fault for address: fffffffffffffffe
>> [ 114.495280] #PF: supervisor read access in kernel mode
>> [ 114.495738] #PF: error_code(0x0000) - not-present page
>> [ 114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
>> [ 114.496713] Oops: 0000 [#1] SMP PTI
>> [ 114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
>> [ 114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
>> [ 114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
>> [ 114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
>> [ 114.500788] RSP: 0018:ffffa5e601a0fe60 EFLAGS: 00010202
>> [ 114.501373] RAX: fffffffffffffffe RBX: ffffffffffffffff RCX: 0000000000000000
>> [ 114.502009] RDX: 0000000000000001 RSI: 00007ffca13a7310 RDI: ffffd07489000000
>> [ 114.502637] RBP: ffffd07489000000 R08: 0000000000000001 R09: 0000000000000000
>> [ 114.503270] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000240000
>> [ 114.503896] R13: 0000000000080000 R14: 00007ffca13a7310 R15: ffffa5e601a0ff08
>> [ 114.504530] FS: 00007f0266c7f540(0000) GS:ffff962dbbac0000(0000) knlGS:0000000000000000
>> [ 114.505245] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 114.505754] CR2: fffffffffffffffe CR3: 000000023a204000 CR4: 00000000000006e0
>> [ 114.506401] Call Trace:
>> [ 114.506660] kpageflags_read+0xb1/0x130
>> [ 114.507051] proc_reg_read+0x39/0x60
>> [ 114.507387] vfs_read+0x8a/0x140
>> [ 114.507686] ksys_pread64+0x61/0xa0
>> [ 114.508021] do_syscall_64+0x5f/0x1a0
>> [ 114.508372] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [ 114.508844] RIP: 0033:0x7f0266ba426b
>>
>> The first few pages of ZONE_DEVICE expressed as the range
>> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
>> skipped by struct page initialization. Some pfn walkers like
>> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
>> struct pages, which causes the error.
>>
>> In previous discussion, Dan seemed to have concern that the struct
>> page area of some pages indicated by vmem_altmap->reserve may not
>> be allocated. (See https://lore.kernel.org/lkml/CAPcyv4i5FjTOnPbXNcTzvt+e6RQYow0JRQwSFuxaa62LSuvzHQ@xxxxxxxxxxxxxx/)
>> However, arch_add_memory() called by devm_memremap_pages() allocates
>> struct page area for pages containing addresses in the range
>> (res.start) to (res.start + resource_size(res)), which include the
>> pages indicated by vmem_altmap->reserve. If I read correctly, it is
>> allocated as requested at least on x86_64. Also, memmap_init_zone()
>> initializes struct pages in the same range.
>> So I think the struct pages should be initialized.>
>
> For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
> for the whole section. For ZONE_DEVICE memory we have no such
> indication. In any section that is !SECTION_IS_ONLINE and
> SECTION_MARKED_PRESENT, we could have any subsections initialized. >
> The only indication I am aware of is pfn_zone_device_reserved() - which
> seems to check exactly what you are trying to skip here.
>
> Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
> that already, why did you decide against it?

No, in current approach this function is no longer needed.
The reason why we change the approach is that all pfn walkers
have to be aware of the uninitialized struct pages.

As for SECTION_IS_ONLINE, I'm not sure now.
I will look into it next week.

Thanks,
Toshiki Fukasawa

>
>> Signed-off-by: Toshiki Fukasawa <t-fukasawa@xxxxxxxxxxxxx>
>> Cc: stable@xxxxxxxxxxxxxxx
>> ---
>> Changes since rev 1:
>> Instead of avoiding uninitialized pages on the pfn walker side,
>> we initialize struct pages.
>>
>> mm/page_alloc.c | 5 +----
>> 1 file changed, 1 insertion(+), 4 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 9c91949..6d180ae 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5846,8 +5846,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>
>> #ifdef CONFIG_ZONE_DEVICE
>> /*
>> - * Honor reservation requested by the driver for this ZONE_DEVICE
>> - * memory. We limit the total number of pages to initialize to just
>> + * We limit the total number of pages to initialize to just
>> * those that might contain the memory mapping. We will defer the
>> * ZONE_DEVICE page initialization until after we have released
>> * the hotplug lock.
>> @@ -5856,8 +5855,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>> if (!altmap)
>> return;
>>
>> - if (start_pfn == altmap->base_pfn)
>> - start_pfn += altmap->reserve;
>> end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>> }
>> #endif
>>
>
>