Re: [PATCH] mm: khugepaged: fix potential page state corruption

From: Yang Shi
Date: Fri Mar 20 2020 - 12:34:31 EST




On 3/20/20 4:45 AM, Kirill A. Shutemov wrote:
On Thu, Mar 19, 2020 at 09:57:47AM -0700, Yang Shi wrote:

On 3/19/20 3:49 AM, Kirill A. Shutemov wrote:
On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote:
On 3/18/20 5:55 PM, Yang Shi wrote:
On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
When khugepaged collapses anonymous pages, the base pages would
be freed
via pagevec or free_page_and_swap_cache(). But, the anonymous page may
be added back to LRU, then it might result in the below race:

ÂÂÂÂCPU AÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ CPU B
khugepaged:
ÂÂ unlock page
ÂÂ putback_lru_page
ÂÂÂÂ add to lru
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ page reclaim:
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ isolate this page
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ try_to_unmap
ÂÂ page_remove_rmap <-- corrupt _mapcount

It looks nothing would prevent the pages from isolating by reclaimer.
Hm. Why should it?

try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
protected by ptl. And this particular _mapcount pin is reachable for
reclaim as it's not part of usual page table tree. Basically
try_to_unmap() will never succeeds until we give up the _mapcount on
khugepaged side.
I don't quite get. What does "not part of usual page table tree" means?

How's about try_to_unmap() acquires ptl before khugepaged?
The page table we are dealing with was detached from the process' page
table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the
pte.

try_to_unmap() can only reach the ptl if split ptl is disabled
(mm->page_table_lock is used), but it still will not be able to reach pte.
Aha, got it. Thanks for explaining. I definitely missed this point. Yes,
pmdp_collapse_flush() would clear the pmd, then others won't see the page
table.

However, it looks the vmscan would not stop at try_to_unmap() at all,
try_to_unmap() would just return true since pmd_present() should return
false in pvmw. Then it would go all the way down to __remove_mapping(), but
freezing the page would fail since try_to_unmap() doesn't actually drop the
refcount from the pte map.
No. try_to_unmap() checks mapcount at the end and only returns true if
it's zero.

Aha, yes, you are right. It does check mapcount. It would not go that far.


It would not result in any critical problem AFAICT, but suboptimal and it
may causes some unnecessary I/O due to swap.

I don't see the issue right away.

The other problem is the page's active or unevictable flag might be
still set when freeing the page via free_page_and_swap_cache().
So what?
The flags may leak to page free path then kernel may complain if
DEBUG_VM is set.
Could you elaborate on what codepath you are talking about?
__put_page ->
ÂÂÂ __put_single_page ->
ÂÂÂ ÂÂÂ free_unref_page ->
ÂÂÂ ÂÂÂ ÂÂÂ put_unref_page_prepare ->
ÂÂÂ ÂÂÂ ÂÂÂ ÂÂÂ free_pcp_prepare ->
ÂÂÂ ÂÂÂ ÂÂÂ ÂÂÂ ÂÂÂ free_pages_prepare ->
ÂÂÂ ÂÂÂ ÂÂÂ ÂÂÂ ÂÂÂ ÂÂÂ free_pages_check

This check would just be run when DEBUG_VM is enabled.
I'm not 100% sure, but I belive these flags will ge cleared on adding into
lru:

release_pte_page()
putback_lru_page()
lru_cache_add()
__lru_cache_add()
__pagevec_lru_add()
__pagevec_lru_add_fn()
__pagevec_lru_add_fn()

No, adding into lru would not clear the flags. But, I finally found where they get cleared. They get cleared by:

__put_single_page() ->
ÂÂÂ __page_cache_release() ->
ÂÂÂ ÂÂÂ if page is lru
ÂÂÂ ÂÂÂ del_page_from_lru_list(page, lruvec, page_off_lru(page));

page_off_lru() would clear active and unevictable flags.

The name (__page_cache_release) sounds a little bit confusing I misunderstood it just would done something for page cache.

So, it looks the code does depend on putting page back to lru to release it. Nothing is wrong, just a little bit unproductive IMHO. Sorry for the rash patch. And thank you for your time again.