Re: [PATCH 1/3] mm, lru_gen: batch update counters on againg

From: Kairui Song
Date: Mon Dec 25 2023 - 13:06:22 EST


Yu Zhao <yuzhao@xxxxxxxxxx> 于2023年12月25日周一 15:29写道:
>
> On Fri, Dec 22, 2023 at 3:24 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> >
> > From: Kairui Song <kasong@xxxxxxxxxxx>
> >
> > When lru_gen is aging, it will update mm counters page by page,
> > which causes a higher overhead if age happens frequently or there
> > are a lot of pages in one generation getting moved.
> > Optimize this by doing the counter update in batch.
> >
> > Although most __mod_*_state has its own caches the overhead
> > is still observable.
> >
> > Tested in a 4G memcg on a EPYC 7K62 with:
> >
> > memcached -u nobody -m 16384 -s /tmp/memcached.socket \
> > -a 0766 -t 16 -B binary &
> >
> > memtier_benchmark -S /tmp/memcached.socket \
> > -P memcache_binary -n allkeys \
> > --key-minimum=1 --key-maximum=16000000 -d 1024 \
> > --ratio=1:0 --key-pattern=P:P -c 2 -t 16 --pipeline 8 -x 6
> >
> > Average result of 18 test runs:
> >
> > Before: 44017.78 Ops/sec
> > After: 44687.08 Ops/sec (+1.5%)
> >
> > Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>
> > ---
> > mm/vmscan.c | 64 +++++++++++++++++++++++++++++++++++++++++++++--------
> > 1 file changed, 55 insertions(+), 9 deletions(-)
>
> Usually most reclaim activity happens in kswapd, e.g., from the
> MongoDB benchmark (--duration=900):
> pgscan_kswapd 11294317
> pgscan_direct 128
> And kswapd always has current->reclaim_state->mm_walk. So the
> following should bring the vast majority of the improvement (assuming
> it's not noise) with far less code change:

Hi Yu,

This won't work for the fault path (eg. the memtier test):
Samples: 30K of event 'cycles', Event count (approx.): 69411674954
Children Self Command Shared Object Symbol
- 85.95% 0.69% memcached [kernel.vmlinux] [k]
asm_exc_page_fault
- 85.25% asm_exc_page_fault
- 85.00% exc_page_fault
- 84.81% do_user_addr_fault
- 84.01% handle_mm_fault
- 83.70% __handle_mm_fault
- 82.57% do_swap_page
- 61.66% mem_cgroup_swapin_charge_folio
- 61.11% charge_memcg
- 60.76% try_charge_memcg
- 60.68% try_to_free_mem_cgroup_pages
do_try_to_free_pages
- shrink_node
- 60.51% shrink_lruvec
- 60.45% try_to_shrink_lruvec
+ 60.42% evict_folios
+ 10.00% __swap_entry_free
+ 3.81% swap_read_folio_bdev_sync
+ 1.49% __pte_offset_map_lock
+ 0.92% swap_cache_get_folio
+ 0.80% folio_add_lru
+ 0.75% vma_alloc_folio
+ 0.60% swap_read_folio
+ 0.73% do_anonymous_page
0.54% lock_vma_under_rcu

And:
sudo cat /sys/kernel/debug/lru_gen_full | grep -A 25 benchmark
memcg 72 /benchmark
node 0
218 3283 1x 0x
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
219 2472 2756 0
0 14775r 303395e 0p 2r
2e 0p
1 0r 0e 0p 0r
0e 0p
2 0r 0e 0p 0r
0e 0p
3 0r 0e 15262p 0r
0e 0p
0 0 0 0
0 0
220 1652 456032 22
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
221 808 578570 13
0 15665R 309071T 0 0R
1T 0
1 0R 0T 0 0R
0T 0
2 0R 0T 0 0R
0T 0
3 0R 15364T 0 0R
0T 0
9191594L 3532525O 2425411Y 94393N
18515F 10578A

It ages fast.

It's hard to share the code with mm_walk, because in next patch, it
tries to move the pages in bulk, there is no such logic for mm_walk.

It's not very effective with this benchmark indeed, I'll update with
some other tests.