Re: [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block

From: Kirill A. Shutemov
Date: Fri Dec 15 2017 - 04:33:18 EST


On Fri, Dec 15, 2017 at 10:04:27AM +0530, Anshuman Khandual wrote:
> On 12/15/2017 01:23 AM, Yang Shi wrote:
> > In the current design, khugepaged need acquire mmap_sem before scanning
> > mm, but in some corner case, khugepaged may scan the current running
> > process which might be modifying memory mapping, so khugepaged might
> > block in uninterruptible state. But, the process might hold the mmap_sem
> > for long time when modifying a huge memory space, then it may trigger
> > the below khugepaged hung issue:
> >
> > INFO: task khugepaged:270 blocked for more than 120 seconds.
> > Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > khugepaged D 0 270 2 0x00000000
> > ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440
> > ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64
> > 0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0
> > Call Trace:
> > [<ffffffff817079a5>] ? __schedule+0x235/0x6e0
> > [<ffffffff81707e86>] schedule+0x36/0x80
> > [<ffffffff8170a970>] rwsem_down_read_failed+0xf0/0x150
> > [<ffffffff81384998>] call_rwsem_down_read_failed+0x18/0x30
> > [<ffffffff8170a1c0>] down_read+0x20/0x40
> > [<ffffffff81226836>] khugepaged+0x476/0x11d0
> > [<ffffffff810c9d0e>] ? idle_balance+0x1ce/0x300
> > [<ffffffff810d0850>] ? prepare_to_wait_event+0x100/0x100
> > [<ffffffff812263c0>] ? collapse_shmem+0xbf0/0xbf0
> > [<ffffffff810a8d46>] kthread+0xe6/0x100
> > [<ffffffff810a8c60>] ? kthread_park+0x60/0x60
> > [<ffffffff8170cd15>] ret_from_fork+0x25/0x30

What holds the lock for this long? I think the other side also worth fixing.

> >
> > So, it sounds pointless to just block for waiting for the semaphore for
> > khugepaged, here replace down_read() to down_read_trylock() to move to
> > scan next mm quickly instead of just blocking on the semaphore so that
> > other processes can get more chances to install THP.
> > Then khugepaged can come back to scan the skipped mm when finish the
> > current round full_scan.
>
> That may be too harsh on the process which now has to wait for a complete
> round of full scan before the khugepaged comes back. What if the mmap_sem
> contention because of VMA changes in the process was just temporary ?

It's always temporary. Unless something is very broken. :)

If the mmap_sem is taken for write, it may also mean that memory layout of the
process is not yet settled and we can just waste the time collapsing the pages
that about to go away.

And it's better for khugepaged to do the job than just waiting for the lock.

Acked-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>

--
Kirill A. Shutemov