Re: [PATCH 5/7] mm: free user PTE page table pages

From: Qi Zheng
Date: Mon Jul 19 2021 - 09:55:15 EST


On 7/19/21 6:01 AM, Kirill A. Shutemov wrote:
On Sun, Jul 18, 2021 at 12:30:31PM +0800, Qi Zheng wrote:
Some malloc libraries(e.g. jemalloc or tcmalloc) usually
allocate the amount of VAs by mmap() and do not unmap
those VAs. They will use madvise(MADV_DONTNEED) to free
physical memory if they want. But the page tables do not
be freed by madvise(), so it can produce many page tables
when the process touches an enormous virtual address space.

The following figures are a memory usage snapshot of one
process which actually happened on our server:

VIRT: 55t
RES: 590g
VmPTE: 110g

As we can see, the PTE page tables size is 110g, while the
RES is 590g. In theory, the process only need 1.2g PTE page
tables to map those physical memory. The reason why PTE page
tables occupy a lot of memory is that madvise(MADV_DONTNEED)
only empty the PTE and free physical memory but doesn't free
the PTE page table pages. So we can free those empty PTE page
tables to save memory. In the above cases, we can save memory
about 108g(best case). And the larger the difference between
the size of VIRT and RES, the more memory we save.

In this patch series, we add a pte_refcount field to the
struct page of page table to track how many users of PTE page
table. Similar to the mechanism of page refcount, the user of
PTE page table should hold a refcount to it before accessing.
The PTE page table page will be freed when the last refcount
is dropped.

The patch is very hard to review.

Could you split up introduction of the new API in the separate patch? With
a proper documentation of the API.

Good idea, i will do it.


Why pte_refcount is atomic? Looks like you do everything under pmd_lock().
Do I miss something?

When we do pte_get_unless_zero(), we hold pmd_lock to protect against
free_pte_table(). But we don't need to hold the pmd lock when we do
pte_get()/pte_put() in mapping/unmapping routine.


And performance numbers should be included. I don't expect pmd_lock() in
all hotpaths to scale well.


Yeah, so we use rcu lock to replace the pmd lock in some routines in the
subsequent patch (mm: defer freeing PTE page table for a grace period).

Thanks,

Qi