Re: [PATCH] mm: khugepaged: fix potential page state corruption

From: Yang Shi
Date: Wed Mar 25 2020 - 14:42:28 EST




On 3/25/20 4:26 AM, Kirill A. Shutemov wrote:
On Tue, Mar 24, 2020 at 10:17:13AM -0700, Yang Shi wrote:

On 3/19/20 3:49 AM, Kirill A. Shutemov wrote:
On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote:
On 3/18/20 5:55 PM, Yang Shi wrote:
On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
When khugepaged collapses anonymous pages, the base pages would
be freed
via pagevec or free_page_and_swap_cache(). But, the anonymous page may
be added back to LRU, then it might result in the below race:

ÂÂÂÂCPU AÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ CPU B
khugepaged:
ÂÂ unlock page
ÂÂ putback_lru_page
ÂÂÂÂ add to lru
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ page reclaim:
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ isolate this page
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ try_to_unmap
ÂÂ page_remove_rmap <-- corrupt _mapcount

It looks nothing would prevent the pages from isolating by reclaimer.
Hm. Why should it?

try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
protected by ptl. And this particular _mapcount pin is reachable for
reclaim as it's not part of usual page table tree. Basically
try_to_unmap() will never succeeds until we give up the _mapcount on
khugepaged side.
I don't quite get. What does "not part of usual page table tree" means?

How's about try_to_unmap() acquires ptl before khugepaged?
The page table we are dealing with was detached from the process' page
table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the
pte.
A follow-up question here. pmdp_collapse_flush() clears pmd entry and does
TLB shootdown on x86. I'm supposed the main purpose is to serialize fast gup
since it doesn't acquire any lock (mmap_sem, ptl ,etc), but disable
interrupt so the TLB shootdown IPI would get blocked. This could guarantee
synchronization on x86, but it looks not all architectures do TLB shootdown
or implement it via IPI, so how they could serialize with fast gup?
The main purpose of pmdp_collapse_flush() is to block access to pages
under collapse, including access via GUP (and its variants).

It's up to architecture to implement it correctly, including TLB flush vs.
GUP_fast serialization. Genetic way works fine for most architectures.
Notable exceptions are Power and S390.

Thanks. I was wondering how Power and S390 serialized it. It looks they didn't deal with it at all.

In addition it looks acquiring pmd lock is not necessary. Before both write
mmap_sem and write anon_vma lock are acquired which could serialize page
fault and rmap walk, so it looks fast gup is the only one which could run
concurrently, but fast gup doesn't acquire ptl at all. It seems the
pmd_lock/unlock could be removed.
This is likely true. And we have a comment there. But taking uncontended
lock is check, so why not.

Yes, it sounds not harmful.