Re: [PATCH] prevent deadlock in __unmap_hugepage_range() whenalloc_huge_page() fails.

From: Adam Litke
Date: Tue Nov 17 2009 - 11:44:23 EST


This looks good to me. The venerable hugetlb_mutex already prevents
allocation/instantiation races in hugetlb_cow() so the extra spinlock
dropping should have no adverse effects with respect to that.

Acked-by: Adam Litke <agl@xxxxxxxxxx>

On Tue, 2009-11-17 at 16:09 +0000, Mel Gorman wrote:
> On Mon, Nov 16, 2009 at 12:12:53PM -0800, Andrew Morton wrote:
> > On Tue, 10 Nov 2009 12:00:56 -0500
> > Larry Woodman <lwoodman@xxxxxxxxxx> wrote:
> >
> > >
> > > hugetlb_fault() takes the mm->page_table_lock spinlock then calls
> > > hugetlb_cow(). If the alloc_huge_page() in hugetlb_cow() fails due to
> > > an insufficient huge page pool it calls unmap_ref_private() with the
> > > mm->page_table_lock held. unmap_ref_private() then calls
> > > unmap_hugepage_range() which tries to acquire the mm->page_table_lock.
> > >
> > >
> > > [<ffffffff810928c3>] print_circular_bug_tail+0x80/0x9f
> > > [<ffffffff8109280b>] ? check_noncircular+0xb0/0xe8
> > > [<ffffffff810935e0>] __lock_acquire+0x956/0xc0e
> > > [<ffffffff81093986>] lock_acquire+0xee/0x12e
> > > [<ffffffff8111a7a6>] ? unmap_hugepage_range+0x3e/0x84
> > > [<ffffffff8111a7a6>] ? unmap_hugepage_range+0x3e/0x84
> > > [<ffffffff814c348d>] _spin_lock+0x40/0x89
> > > [<ffffffff8111a7a6>] ? unmap_hugepage_range+0x3e/0x84
> > > [<ffffffff8111afee>] ? alloc_huge_page+0x218/0x318
> > > [<ffffffff8111a7a6>] unmap_hugepage_range+0x3e/0x84
> > > [<ffffffff8111b2d0>] hugetlb_cow+0x1e2/0x3f4
> > > [<ffffffff8111b935>] ? hugetlb_fault+0x453/0x4f6
> > > [<ffffffff8111b962>] hugetlb_fault+0x480/0x4f6
> > > [<ffffffff8111baee>] follow_hugetlb_page+0x116/0x2d9
> > > [<ffffffff814c31a7>] ? _spin_unlock_irq+0x3a/0x5c
> > > [<ffffffff81107b4d>] __get_user_pages+0x2a3/0x427
> > > [<ffffffff81107d0f>] get_user_pages+0x3e/0x54
> > > [<ffffffff81040b8b>] get_user_pages_fast+0x170/0x1b5
> > > [<ffffffff81160352>] dio_get_page+0x64/0x14a
> > > [<ffffffff8116112a>] __blockdev_direct_IO+0x4b7/0xb31
> > > [<ffffffff8115ef91>] blkdev_direct_IO+0x58/0x6e
> > > [<ffffffff8115e0a4>] ? blkdev_get_blocks+0x0/0xb8
> > > [<ffffffff810ed2c5>] generic_file_aio_read+0xdd/0x528
> > > [<ffffffff81219da3>] ? avc_has_perm+0x66/0x8c
> > > [<ffffffff81132842>] do_sync_read+0xf5/0x146
> > > [<ffffffff8107da00>] ? autoremove_wake_function+0x0/0x5a
> > > [<ffffffff81211857>] ? security_file_permission+0x24/0x3a
> > > [<ffffffff81132fd8>] vfs_read+0xb5/0x126
> > > [<ffffffff81133f6b>] ? fget_light+0x5e/0xf8
> > > [<ffffffff81133131>] sys_read+0x54/0x8c
> > > [<ffffffff81011e42>] system_call_fastpath+0x16/0x1b
> >
> > Confused.
> >
> > That code is very old, alloc_huge_page() failures are surely common and
> > afaict the bug will lead to an instantly dead box. So why haven't
> > people hit this bug before now?
> >
>
> Failures like that can happen when the workload is calling fork() with
> MAP_PRIVATE and the child is writing with a small hugepage pool. I reran the
> tests used at the time the code was written and they don't trigger warnings or
> problems in the normal case other than the note that the child got killed. It
> required PROVE_LOCKING to be set which is not set on the default configs I
> normally use when I don't suspect locking problems.
>
> I can confirm that with PROVE_LOCKING set that the warnings do trigger but
> the machine doesn't lockup and I see in dmesg. Bit of a surprise really,
> you'd think a double taking of a spinlock would result in damage.
>
> [ 111.031795] PID 2876 killed due to inadequate hugepage pool
>
> Applying the patch does fix that problem but there is a very similar
> problem remaining in the same path. More on this later.
>
> > > This can be fixed by dropping the mm->page_table_lock around the call
> > > to unmap_ref_private() if alloc_huge_page() fails, its dropped right below
> > > in the normal path anyway:
> > >
> >
> > Why is that safe? What is the page_table_lock protecting in here?
> >
>
> The lock is there from 2005 and it was to protect against against concurrent
> PTE updates. However, it is probably unnecessary protection as this whole
> path should be protected by the hugetlb_instantiation_mutex. There was locking
> that was left behind that may be unnecessary after that mutex was introducted
> but is left in place for the day someone decides to tackle that mutex.
>
> I don't think this patch solves everything with the page_table_lock in
> that path because it's also possible from the COW path to enter the buddy
> allocator with the spinlock still held with setups like
>
> 1. create a mapping of 3 huge pages
> 2. allow the system to dynamically allocate 6 huge pages
> 3. fork() and fault in the child
> with preempt and DEBUG_SPINLOCK_SLEEP set, a different
> warning can still trigger
>
> Here is alternative patch below which extends Larry's patch to drop the
> spinlock earlier and retake as required. Because it involves page table
> updates, I've added Hugh to the cc because he knows all the rules and
> gotchas backwards.
>
> ==== CUT HERE ====
> hugetlb: prevent deadlock in __unmap_hugepage_range() when alloc_huge_page() fails.
>
> hugetlb_fault() takes the mm->page_table_lock spinlock then calls
> hugetlb_cow(). If the alloc_huge_page() in hugetlb_cow() fails due to an
> insufficient huge page pool it calls unmap_ref_private() with the
> mm->page_table_lock held. unmap_ref_private() then calls
> unmap_hugepage_range() which tries to acquire the mm->page_table_lock.
>
> [<ffffffff810928c3>] print_circular_bug_tail+0x80/0x9f
> [<ffffffff8109280b>] ? check_noncircular+0xb0/0xe8
> [<ffffffff810935e0>] __lock_acquire+0x956/0xc0e
> [<ffffffff81093986>] lock_acquire+0xee/0x12e
> [<ffffffff8111a7a6>] ? unmap_hugepage_range+0x3e/0x84
> [<ffffffff8111a7a6>] ? unmap_hugepage_range+0x3e/0x84
> [<ffffffff814c348d>] _spin_lock+0x40/0x89
> [<ffffffff8111a7a6>] ? unmap_hugepage_range+0x3e/0x84
> [<ffffffff8111afee>] ? alloc_huge_page+0x218/0x318
> [<ffffffff8111a7a6>] unmap_hugepage_range+0x3e/0x84
> [<ffffffff8111b2d0>] hugetlb_cow+0x1e2/0x3f4
> [<ffffffff8111b935>] ? hugetlb_fault+0x453/0x4f6
> [<ffffffff8111b962>] hugetlb_fault+0x480/0x4f6
> [<ffffffff8111baee>] follow_hugetlb_page+0x116/0x2d9
> [<ffffffff814c31a7>] ? _spin_unlock_irq+0x3a/0x5c
> [<ffffffff81107b4d>] __get_user_pages+0x2a3/0x427
> [<ffffffff81107d0f>] get_user_pages+0x3e/0x54
> [<ffffffff81040b8b>] get_user_pages_fast+0x170/0x1b5
> [<ffffffff81160352>] dio_get_page+0x64/0x14a
> [<ffffffff8116112a>] __blockdev_direct_IO+0x4b7/0xb31
> [<ffffffff8115ef91>] blkdev_direct_IO+0x58/0x6e
> [<ffffffff8115e0a4>] ? blkdev_get_blocks+0x0/0xb8
> [<ffffffff810ed2c5>] generic_file_aio_read+0xdd/0x528
> [<ffffffff81219da3>] ? avc_has_perm+0x66/0x8c
> [<ffffffff81132842>] do_sync_read+0xf5/0x146
> [<ffffffff8107da00>] ? autoremove_wake_function+0x0/0x5a
> [<ffffffff81211857>] ? security_file_permission+0x24/0x3a
> [<ffffffff81132fd8>] vfs_read+0xb5/0x126
> [<ffffffff81133f6b>] ? fget_light+0x5e/0xf8
> [<ffffffff81133131>] sys_read+0x54/0x8c
> [<ffffffff81011e42>] system_call_fastpath+0x16/0x1b
>
> This can be fixed by dropping the mm->page_table_lock around the call to
> unmap_ref_private() if alloc_huge_page() fails, its dropped right below in
> the normal path anyway. However, earlier in the that function, it's also
> possible to call into the page allocator with the same spinlock held.
>
> What this patch does is drop the spinlock before the page allocator is
> potentially entered. The check for page allocation failure can be made
> without the page_table_lock as well as the copy of the huge page. Even if
> the PTE changed while the spinlock was held, the consequence is that a huge
> page is copied unnecessarily. This resolves both the double taking of the
> lock and sleeping with the spinlock held.
>
> [mel@xxxxxxxxx: Cover also the case where process can sleep with spinlock]
> Signed-off-by: Larry Woodman <lwooman@xxxxxxxxxx>
> Signed-off-by: Mel Gorman <mel@xxxxxxxxx>
> Cc: Adam Litke <agl@xxxxxxxxxx>
> Cc: Andy Whitcroft <apw@xxxxxxxxxxxx>
> Cc: Lee Schermerhorn <lee.schermerhorn@xxxxxx>
> Cc: Hugh Dickins <hugh.dickins@xxxxxxxxxxxxx>
> Cc: David Gibson <david@xxxxxxxxxxxxxxxxxxxxx>
> ---
> mm/hugetlb.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 450493d..2ef66a2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2293,6 +2293,9 @@ retry_avoidcopy:
> outside_reserve = 1;
>
> page_cache_get(old_page);
> +
> + /* Drop page_table_lock as buddy allocator may be called */
> + spin_unlock(&mm->page_table_lock);
> new_page = alloc_huge_page(vma, address, outside_reserve);
>
> if (IS_ERR(new_page)) {
> @@ -2310,19 +2313,25 @@ retry_avoidcopy:
> if (unmap_ref_private(mm, vma, old_page, address)) {
> BUG_ON(page_count(old_page) != 1);
> BUG_ON(huge_pte_none(pte));
> + spin_lock(&mm->page_table_lock);
> goto retry_avoidcopy;
> }
> WARN_ON_ONCE(1);
> }
>
> + /* Caller expects lock to be held */
> + spin_lock(&mm->page_table_lock);
> return -PTR_ERR(new_page);
> }
>
> - spin_unlock(&mm->page_table_lock);
> copy_huge_page(new_page, old_page, address, vma);
> __SetPageUptodate(new_page);
> - spin_lock(&mm->page_table_lock);
>
> + /*
> + * Retake the page_table_lock to check for racing updates
> + * before the page tables are altered
> + */
> + spin_lock(&mm->page_table_lock);
> ptep = huge_pte_offset(mm, address & huge_page_mask(h));
> if (likely(pte_same(huge_ptep_get(ptep), pte))) {
> /* Break COW */


--
Thanks,
Adam

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/