[PATCH AUTOSEL for 4.9 261/293] mm: thp: use down_read_trylock() in khugepaged to avoid long block

From: Sasha Levin
Date: Sun Apr 08 2018 - 21:48:54 EST


From: Yang Shi <yang.s@xxxxxxxxxxxxxxx>

[ Upstream commit 3b454ad35043dfbd3b5d2bb92b0991d6342afb44 ]

In the current design, khugepaged needs to acquire mmap_sem before
scanning an mm. But in some corner cases, khugepaged may scan a process
which is modifying its memory mapping, so khugepaged blocks in
uninterruptible state. But the process might hold the mmap_sem for a
long time when modifying a huge memory space and it may trigger the
below khugepaged hung issue:

INFO: task khugepaged:270 blocked for more than 120 seconds.
Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
khugepaged D 0 270 2 0x00000000Â
ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440
ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64
0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0
Call Trace:
schedule+0x36/0x80
rwsem_down_read_failed+0xf0/0x150
call_rwsem_down_read_failed+0x18/0x30
down_read+0x20/0x40
khugepaged+0x476/0x11d0
kthread+0xe6/0x100
ret_from_fork+0x25/0x30

So it sounds pointless to just block khugepaged waiting for the
semaphore so replace down_read() with down_read_trylock() to move to
scan the next mm quickly instead of just blocking on the semaphore so
that other processes can get more chances to install THP. Then
khugepaged can come back to scan the skipped mm when it has finished the
current round full_scan.

And it appears that the change can improve khugepaged efficiency a
little bit.

Below is the test result when running LTP on a 24 cores 4GB memory 2
nodes NUMA VM:

pristine w/ trylock
full_scan 197 187
pages_collapsed 21 26
thp_fault_alloc 40818 44466
thp_fault_fallback 18413 16679
thp_collapse_alloc 21 150
thp_collapse_alloc_failed 14 16
thp_file_alloc 369 369

[akpm@xxxxxxxxxxxxxxxxxxxx: coding-style fixes]
[akpm@xxxxxxxxxxxxxxxxxxxx: tweak comment]
[arnd@xxxxxxxx: avoid uninitialized variable use]
Link: http://lkml.kernel.org/r/20171215125129.2948634-1-arnd@xxxxxxxx
Link: http://lkml.kernel.org/r/1513281203-54878-1-git-send-email-yang.s@xxxxxxxxxxxxxxx
Signed-off-by: Yang Shi <yang.s@xxxxxxxxxxxxxxx>
Acked-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
Acked-by: Michal Hocko <mhocko@xxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Signed-off-by: Arnd Bergmann <arnd@xxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Sasha Levin <alexander.levin@xxxxxxxxxxxxx>
---
mm/khugepaged.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 898eb26f5dc8..48a39cbdf2d4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1678,10 +1678,14 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
spin_unlock(&khugepaged_mm_lock);

mm = mm_slot->mm;
- down_read(&mm->mmap_sem);
- if (unlikely(khugepaged_test_exit(mm)))
- vma = NULL;
- else
+ /*
+ * Don't wait for semaphore (to avoid long wait times). Just move to
+ * the next mm on the list.
+ */
+ vma = NULL;
+ if (unlikely(!down_read_trylock(&mm->mmap_sem)))
+ goto breakouterloop_mmap_sem;
+ if (likely(!khugepaged_test_exit(mm)))
vma = find_vma(mm, khugepaged_scan.address);

progress++;
--
2.15.1