Re: [RFC v7 00/11] Support vrange for anonymous page

From: Paul Turner
Date: Tue Mar 12 2013 - 19:17:38 EST


On Tue, Mar 12, 2013 at 12:38 AM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
> First of all, let's define the term.
> From now on, I'd like to call it as vrange(a.k.a volatile range)
> for anonymous page. If you have a better name in mind, please suggest.
>
> This version is still *RFC* because it's just quick prototype so
> it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
> Before further sorting out issues, I'd like to post current direction
> and discuss it. Of course, I'd like to extend this discussion in
> comming LSF/MM.
>
> In this version, I changed lots of thing, expecially removed vma-based
> approach because it needs write-side lock for mmap_sem, which will drop
> performance in mutli-threaded big SMP system, KOSAKI pointed out.
> And vma-based approach is hard to meet requirement of new system call by
> John Stultz's suggested semantic for consistent purged handling.
> (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
>
> I tested this patchset with modified jemalloc allocator which was
> leaded by Jason Evans(jemalloc author) who was interest in this feature
> and was happy to port his allocator to use new system call.
> Super Thanks Jason!
>
> The benchmark for test is ebizzy. It have been used for testing the
> allocator performance so it's good for me. Again, thanks for recommending
> the benchmark, Jason.
> (http://people.freebsd.org/~kris/scaling/ebizzy.html)
>
> The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
>
> ebizzy -S 20
>
> jemalloc-vanilla: 52389 records/sec
> jemalloc-vrange: 203414 records/sec
>
> ebizzy -S 20 with background memory pressure
>
> jemalloc-vanilla: 40746 records/sec
> jemalloc-vrange: 174910 records/sec
>
> And it's much improved on KVM virtual machine.
>
> This patchset is based on v3.9-rc2
>
> - What's the sys_vrange(addr, length, mode, behavior)?
>
> It's a hint that user deliver to kernel so kernel can *discard*
> pages in a range anytime. mode is one of VRANGE_VOLATILE and
> VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so
> kernel coudn't discard any pages any more while VRANGE_VOLATILE
> is memory unpin opeartion so kernel can discard pages in vrange
> anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE
> PARTIAL. VRANGE_FULL tell kernel that once kernel decide to
> discard page in a vrange, please, discard all of pages in a
> vrange selected by victim vrange. VRANGE_PARTIAL tell kernel
> that please discard of some pages in a vrange. But now I didn't
> implemented VRANGE_PARTIAL handling yet.
>
> - What happens if user access page(ie, virtual address) discarded
> by kernel?
>
> The user can encounter SIGBUS.
>
> - What should user do for avoding SIGBUS?
> He should call vrange(addr, length, VRANGE_NOVOLATILE, mode) before
> accessing the range which was called
> vrange(addr, length, VRANGE_VOLATILE, mode)
>
> - What happens if user access page(ie, virtual address) doesn't
> discarded by kernel?
>
> The user can see vaild data which was there before calling
> vrange(., VRANGE_VOLATILE) without page fault.
>
> - What's different with madvise(DONTNEED)?
>
> System call semantic
>
> DONTNEED makes sure user always can see zero-fill pages after
> he calls madvise while vrange can see data or encounter SIGBUS.
>
> Internal implementation
>
> The madvise(DONTNEED) should zap all mapped pages in range so
> overhead is increased linearly with the number of mapped pages.
> Even, if user access zapped pages as write mode, page fault +
> page allocation + memset should be happened.
>
> The vrange just register a address range instead of zapping all of pte
> n the vma so it doesn't touch ptes any more.
>
> - What's the benefit compared to DONTNEED?
>
> 1. The system call overhead is smaller because vrange just registers
> a range using interval tree instead of zapping all the page in a range
> so overhead should be really cheap.
>
> 2. It has a chance to eliminate overheads (ex, zapping pte + page fault
> + page allocation + memset(PAGE_SIZE)) if memory pressure isn't
> severe.
>
> 3. It has a potential to zap all ptes and free the pages if memory
> pressure is severe so discard scanning overhead could be smaller - TODO
>
> - What's for targetting?
>
> Firstly, user-space allocator like ptmalloc, jemalloc or heap management
> of virtual machine like Dalvik. Also, it comes in handy for embedded
> which doesn't have swap device so they can't reclaim anonymous pages.
> By discarding instead of swapout, it could be used in the non-swap system.

I think that another potentially useful use-case would be using this
-- or a similar API -- to opportunistically return deep user stack
frames.

This is another place where we strongly care about the time-to-free as
well as the time-to-reallocate in the case of relatively immediate
re-use.

>
> Changelog from v6 - There are many changes.
> * Remove vma-based approach
> * Change system call semantic
> * Add more meaningful experiment
>
> Changelog from v5 - There are many changes.
>
> * Support CONFIG_VOLATILE_PAGE
> * Working with THP/KSM
> * Remove vma hacking logic in m[no]volatile system call
> * Discard page without swap cache
> * Kswapd discard volatile page so we can discard volatile pages
> although we don't have swap.
>
> Changelog from v4
>
> * Add new system call mvolatile/mnovolatile
> * Add sigbus when user try to access volatile range
> * Rebased on v3.7
> * Applied bug fix from John Stultz, Thanks!
>
> Changelog from v3
>
> * Removing madvise(addr, length, MADV_NOVOLATILE).
> * add vmstat about the number of discarded volatile pages
> * discard volatile pages without promotion in reclaim path
>
> Minchan Kim (11):
> vrange: enable generic interval tree
> add vrange basic data structure and functions
> add new system call vrange(2)
> add proc/pid/vrange information
> Add purge operation
> send SIGBUS when user try to access purged page
> keep mm_struct to vrange when system call context
> add LRU handling for victim vrange
> Get rid of depenceny that all pages is from a zone in shrink_page_list
> Purging vrange pages without swap
> add purged page information in vmstat
>
> arch/x86/include/asm/pgtable_types.h | 2 +
> arch/x86/syscalls/syscall_64.tbl | 1 +
> fs/proc/base.c | 1 +
> fs/proc/internal.h | 6 +
> fs/proc/task_mmu.c | 129 ++++++
> include/asm-generic/pgtable.h | 11 +
> include/linux/mm_types.h | 5 +
> include/linux/rmap.h | 15 +-
> include/linux/swap.h | 1 +
> include/linux/vm_event_item.h | 4 +
> include/linux/vrange.h | 59 +++
> include/uapi/asm-generic/mman-common.h | 5 +
> init/main.c | 2 +
> kernel/fork.c | 3 +
> lib/Makefile | 2 +-
> mm/Makefile | 2 +-
> mm/ksm.c | 2 +-
> mm/memory.c | 24 +-
> mm/rmap.c | 23 +-
> mm/swapfile.c | 36 ++
> mm/vmscan.c | 74 +++-
> mm/vmstat.c | 4 +
> mm/vrange.c | 754 +++++++++++++++++++++++++++++++++
> 23 files changed, 1143 insertions(+), 22 deletions(-)
> create mode 100644 include/linux/vrange.h
> create mode 100644 mm/vrange.c
>
> --
> 1.8.1.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/