Re: [Question]: major faults are still triggered after mlockall when numa balancing

From: Aneesh Kumar K.V
Date: Fri Nov 10 2023 - 13:18:34 EST


"zhangpeng (AS)" <zhangpeng362@xxxxxxxxxx> writes:

> Hi everyone,
>
> There is a performance issue that has been bothering us recently.
> This problem can reproduce in the latest mainline version (Linux 6.6).
>
> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process
> to avoid performance problems caused by major fault.
>
> There is a stage in numa fault which will set pte as 0 in do_numa_page() :
> ptep_modify_prot_start() will clear the vmf->pte, until
> ptep_modify_prot_commit() assign a value to the vmf->pte.
>

pte lookup don't expect the pte to be 0 after it got initialized (We do
check pte value without holding ptl and if we find the pte val 0 we
return). So the read-modify-write updates to the pte should make sure we
don't clear the pte right? powerpc did that by marking the pte present
but invalid. Can we do similar for other architecture? The default
implementation of ptep_modify_prot_start() to ptep_get_and_clear() can
result in pte lookup returning wrong pte as explained in the report
because we don't hold ptl and recheck if we find pte == 0


>
> For the data segment of the user-mode program, the global variable area
> is a private mapping. After the pagecache is loaded, the private
> anonymous page is generated after the COW is triggered. Mlockall can
> lock COW pages (anonymous pages), but the original file pages cannot
> be locked and may be reclaimed. If the global variable (private anon page)
> is accessed when vmf->pte is zero which is concurrently set by numa fault,
> a file page fault will be triggered.
>
> At this time, the original private file page may have been reclaimed.
> If the page cache is not available at this time, a major fault will be
> triggered and the file will be read, causing additional overhead.
>
> Our problem scenario is as follows:
>
> task 1 task 2
> ------ ------
> /* scan global variables */
> do_numa_page()
> spin_lock(vmf->ptl)
> ptep_modify_prot_start()
> /* set vmf->pte as null */
> /* Access global variables */
> handle_pte_fault()
> /* no pte lock */
> do_pte_missing()
> do_fault()
> do_read_fault()
> ptep_modify_prot_commit()
> /* ptep update done */
> pte_unmap_unlock(vmf->pte, vmf->ptl)
> do_fault_around()
> __do_fault()
> filemap_fault()
> /* page cache is not available
> and a major fault is triggered */
> do_sync_mmap_readahead()
> /* page_not_uptodate and goto
> out_retry. */
>
> Is there any way to avoid such a major fault?
>

-aneesh