Re: [PATCH 2/3] hugetlbfs: close race between MADV_DONTNEED and page fault

From: Mike Kravetz
Date: Mon Oct 02 2023 - 00:40:37 EST


On 09/30/23 20:55, riel@xxxxxxxxxxx wrote:
> From: Rik van Riel <riel@xxxxxxxxxxx>
>
> Malloc libraries, like jemalloc and tcalloc, take decisions on when
> to call madvise independently from the code in the main application.
>
> This sometimes results in the application page faulting on an address,
> right after the malloc library has shot down the backing memory with
> MADV_DONTNEED.
>
> Usually this is harmless, because we always have some 4kB pages
> sitting around to satisfy a page fault. However, with hugetlbfs
> systems often allocate only the exact number of huge pages that
> the application wants.
>
> Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
> any lock taken on the page fault path, which can open up the following
> race condition:
>
> CPU 1 CPU 2
>
> MADV_DONTNEED
> unmap page
> shoot down TLB entry
> page fault
> fail to allocate a huge page
> killed with SIGBUS
> free page
>
> Fix that race by pulling the locking from __unmap_hugepage_final_range
> into helper functions called from zap_page_range_single. This ensures
> page faults stay locked out of the MADV_DONTNEED VMA until the
> huge pages have actually been freed.
>
> Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
> ---
> include/linux/hugetlb.h | 35 +++++++++++++++++++++++++++++++++--
> mm/hugetlb.c | 20 +++++++++++---------
> mm/memory.c | 13 ++++++++-----

Hi Rik,

Something is not right here. I have not looked closely at the patch,
but running libhugetlbfs test suite hits this NULL deref in misalign (2M: 32).

[ 51.891236] BUG: kernel NULL pointer dereference, address: 00000000000001c0
[ 51.892420] #PF: supervisor read access in kernel mode
[ 51.893353] #PF: error_code(0x0000) - not-present page
[ 51.894207] PGD 80000001eeac0067 P4D 80000001eeac0067 PUD 1fa577067 PMD 0
[ 51.895299] Oops: 0000 [#1] PREEMPT SMP PTI
[ 51.896010] CPU: 0 PID: 1004 Comm: misalign Not tainted 6.6.0-rc3-next-20230925+ #13
[ 51.897285] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[ 51.898674] RIP: 0010:__hugetlb_zap_begin+0x76/0x90
[ 51.899488] Code: 06 48 8b 3a 48 39 cf 73 11 48 81 c7 ff ff ff 3f 48 81 e7 00 00 00 c0 48 89 3a 48 89 df e8 42 cd ff ff 48 8b 83 88 00 00 00 5b <48> 8b b8 c0 01 00 00 48 81 c7 28 01 00 00 e9 87 3b 91 00 0f 1f 80
[ 51.902194] RSP: 0018:ffffc9000487bbf0 EFLAGS: 00010246
[ 51.903019] RAX: 0000000000000000 RBX: 00000000f7a00000 RCX: 00000000c0000000
[ 51.904088] RDX: 0000000000440073 RSI: ffffc9000487bc00 RDI: ffff8881fa71dcb8
[ 51.905207] RBP: 00000000f7800000 R08: 00000000f7a00000 R09: 00000000f7a00000
[ 51.906284] R10: ffff8881fb5b8040 R11: ffff8881fb5b89b0 R12: ffff8881fa71dcb8
[ 51.907351] R13: ffffc9000487bd80 R14: ffffc9000487bc78 R15: 0000000000000001
[ 51.908648] FS: 0000000000000000(0000) GS:ffff888277c00000(0063) knlGS:00000000f7c99700
[ 51.910613] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 51.911983] CR2: 00000000000001c0 CR3: 00000001fa412005 CR4: 0000000000370ef0
[ 51.913417] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 51.914535] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 51.915602] Call Trace:
[ 51.916069] <TASK>
[ 51.916480] ? __die+0x1f/0x70
[ 51.917057] ? page_fault_oops+0x159/0x450
[ 51.917737] ? do_user_addr_fault+0x65/0x850
[ 51.919360] ? exc_page_fault+0x6d/0x1c0
[ 51.920021] ? asm_exc_page_fault+0x22/0x30
[ 51.920712] ? __hugetlb_zap_begin+0x76/0x90
[ 51.921440] unmap_vmas+0xb3/0x100
[ 51.922057] unmap_region.constprop.0+0xcc/0x140
[ 51.922837] ? lock_release+0x142/0x290
[ 51.923474] ? preempt_count_add+0x47/0xa0
[ 51.924150] mmap_region+0x565/0xab0
[ 51.924809] do_mmap+0x35a/0x520
[ 51.925384] vm_mmap_pgoff+0xdf/0x200
[ 51.926008] ksys_mmap_pgoff+0x18f/0x200
[ 51.926834] ? syscall_enter_from_user_mode_prepare+0x19/0x60
[ 51.928006] __do_fast_syscall_32+0x68/0x100
[ 51.928962] do_fast_syscall_32+0x2f/0x70
[ 51.929896] entry_SYSENTER_compat_after_hwframe+0x7b/0x8d

I think you previously built libhugetlbfs, so hopefully you can recreate.

The stack trace (and test) suggest hugetlbfs_file_mmap returns an error
due to misalignment, and then we unmap the vma just previously created.
Looks code is now calling hugetlb_zap_begin before unmap_single_vma.
The code/comment in unmap_single_vma mentions this special cleanup
case. Looks like vma->vm_file is NULL and __hugetlb_zap_begin is doing
a i_mmap_lock_write(vma->vm_file->f_mapping).

if (start != end) {
if (unlikely(is_vm_hugetlb_page(vma))) {
/*
* It is undesirable to test vma->vm_file as it
* should be non-null for valid hugetlb area.
* However, vm_file will be NULL in the error
* cleanup path of mmap_region. When
* hugetlbfs ->mmap method fails,
* mmap_region() nullifies vma->vm_file
* before calling this function to clean up.
* Since no pte has actually been setup, it is
* safe to do nothing in this case.
*/
if (vma->vm_file) {
zap_flags_t zap_flags = details ?
details->zap_flags : 0;
__unmap_hugepage_range_final(tlb, vma, start, end,
NULL, zap_flags);
}

Looks like vma->vm_file is NULL and __hugetlb_zap_begin is trying to do
i_mmap_lock_write(vma->vm_file->f_mapping).

Guess I did look closely. :)
--
Mike Kravetz