Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

From: Kairui Song
Date: Thu Jan 11 2024 - 13:25:01 EST


Yu Zhao <yuzhao@xxxxxxxxxx> 于2024年1月11日周四 15:02写道:
> Could you try the attached patch on the mainline v6.7 and see how it
> compares with the results above? Thanks.

Hi Yu,

Thanks for the patch, it helped in some degrees, but not as effective:
On that exclusive baremetal, I did a resetup, rebase on 6.7 mainline
and reran the test:

Refault distance series:
==================================================================
Execution Results after 901 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 4224 27030724835.9 0.16 txn/s
------------------------------------------------------------------
TOTAL 4224 27030724835.9 0.16 txn/s

workingset_nodes 111349
workingset_refault_anon 261331
workingset_refault_file 42862224
workingset_activate_anon 0
workingset_activate_file 13803763
workingset_restore_anon 250743
workingset_restore_file 599031
workingset_nodereclaim 23708

memcg 67 /machine.slice/libpod-edbf5a3cb2574c60180c1fb5ddb2fb160df00bcee3758b7649f2b31baa97ed78.scope/container
node 0
10 347163 518379 207449
0 0r 2e 0p 33017r
1726749e 0p
1 0r 0e 0p 7278r
496268e 0p
2 0r 0e 0p 19789r
55418e 0p
3 0r 0e 0p 0r
0e 4747801p
0 0 0 0
0 0
11 283279 154400 4791558
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
12 158723 431513 37647
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
13 44775 104986 27258
0 576R 982T 0 2488768R
5769505T 0
1 0R 0T 0 2335910R
3357277T 0
2 0R 0T 0 647398R
753021T 0
3 0R 20T 0 52725R
4740516T 0
2819476L 31196O 2551928Y 8298N
5549F 5329A

Device tps kB_read/s kB_wrtn/s kB_dscd/s
kB_read kB_wrtn kB_dscd
dm-0 12.81 546.32 39.04 0.00
520178 37171 0
dm-1 0.05 1.10 0.00 0.00
1044 0 0
nvme0n1 13.17 561.99 41.19 0.00
535103 39219 0
nvme1n1 5220.39 227385.96 1028.17 0.00
216505545 978976 0
zram0 2440.61 2856.32 6907.13 0.00
2719644 6576628 0

total used free shared buff/cache available
Mem: 31830 11251 332 0 20246 20144
Swap: 31829 3761 28068

Your attachment:
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 4070 27170023578.4 0.15 txn/s
------------------------------------------------------------------
TOTAL 4070 27170023578.4 0.15 txn/s

workingset_nodes 121864
workingset_refault_anon 430917
workingset_refault_file 42915675
workingset_activate_anon 100194
workingset_activate_file 21619480
workingset_restore_anon 100194
workingset_restore_file 165054
workingset_nodereclaim 26851

memcg 65 /machine.slice/libpod-c6d8c5fedb9b390ec7f1db7d0d7c57d6a284a94e74a3923d93ea0ce4e4ffdf28.scope/container
node 0
8 418689 55033 106862
0 16r 17e 0p 2789768r
6034831e 0p
1 0r 0e 0p 239664r
490278e 0p
2 0r 0e 0p 79145r
126408e 0p
3 23r 23e 0p 23404r
27107e 4736933p
0 0 0 0
0 0
9 322798 237713 4759110
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
10 182729 942701 5348
0 0 0 0 0
0 0
1 0 0 0 0
0 0
2 0 0 0 0
0 0
3 0 0 0 0
0 0
0 0 0 0
0 0
11 120287 560 375
0 25187R 29324T 0 1679308R
4256147T 0
1 0R 0T 0 153592R
364122T 0
2 0R 0T 0 51825R
98646T 0
3 101R 2944T 0 13985R
4743515T 0
7702245L 865749O 6514831Y 16843N
15088F 14167A

Device tps kB_read/s kB_wrtn/s kB_dscd/s
kB_read kB_wrtn kB_dscd
dm-0 11.49 489.97 41.80 0.00
488006 41633 0
dm-1 0.05 1.05 0.00 0.00
1044 0 0
nvme0n1 11.83 504.95 43.86 0.00
502932 43682 0
nvme0n1 5145.44 218803.29 984.46 0.00
217928081 980520 0
zram0 3164.11 4399.55 8257.84 0.00
4381952 8224812 0

total used free shared buff/cache available
Mem: 31830 11583 310 1 19935 19809
Swap: 31829 3710 28119

Refault distance series still have a better performance and lower total IO.

Similar result on that VM:
==================================================================
Execution Results after 907 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 1667 27151581934.5 0.06 txn/s
------------------------------------------------------------------
TOTAL 1667 27151581934.5 0.06 txn/s

While refault distance series had about ~2500 - 2600 txns, mainline
6.7 had about ~800 - 900 txns.

Loop test so far:
Using refault distance seriese (previous result, it doesn't change much anyway):
STOCK_LEVEL 2605 27120667462.8 0.10 txn/s
STOCK_LEVEL 3000 27106854857.2 0.11 txn/s
STOCK_LEVEL 2925 27066601064.4 0.11 txn/s
STOCK_LEVEL 2757 27035248005.2 0.10 txn/s
STOCK_LEVEL 1325 28053716046.8 0.05 txn/s
STOCK_LEVEL 717 27455091366.3 0.03 txn/s
STOCK_LEVEL 967 27404085208.2 0.04 txn/s
Refault stat here:
workingset_refault_anon 109337
workingset_refault_file 191249716

Using the attached patch:
STOCK_LEVEL 1667 27151581934.5 0.06 txn/s
STOCK_LEVEL 2999 27085125092.3 0.11 txn/s
STOCK_LEVEL 2874 27120635371.2 0.11 txn/s
STOCK_LEVEL 2658 27139142413.9 0.10 txn/s
STOCK_LEVEL 1254 27526009063.7 0.05 txn/s
STOCK_LEVEL 993 28065506801.8 0.04 txn/s
STOCK_LEVEL 954 27226012906.3 0.04 txn/s
Refault stat here:
workingset_refault_anon 383579
workingset_refault_file 205493832

The peak performance almost equal, but still starts slow, refault is
higher too. File refault might be interfered due to some IO layer
issue, but anon refault is always accurate.

I see the improvement you did in the attachment patch, I think
actually they are not in conflict with the refault distance series.
Maybe they can be combined into a even better result.

Refault distance (which originally used by active/inactive LRU) is
used here to give evicted pages priorities based on eviction distance
and add extra feedback to PID and gen. While the PID info recorded in
page flags/shadow represents pages's access pattern before eviction,
and all the check and logics about it can also be improved.

One critical effect of the refault distance series that boost the
MongoDB startup (and I haven't see any negative effect of it on other
test / workload / benchmark yet, except the overhead of memcg
statistics itself) is it prevents overprotecting of tier 0 page: that
is, a tier 0 page evicted but refaulted very quickly (refault distance
< LRU / MAX_NR_GEN, this value may worth some more adjustment, but
with LRU / MAX_NR_GEN, it can be imaged as an idea that having a small
shadow gen holding these page shadows...) will be categorised as tier
1 and get protect. Other wise, if I got everything right, when most
pages are stuck in tier 0 and keep refaulting, tier 0 will have a very
high refault rate, and no pages will be protect, until randomness
causes quick repeated read of some page, so they get promoted to tier
3 get get protected.

Now min_seq contains lower tier pages and new pages will be added to
min_seq too, so min_seq will stay for a long time, while min_seq + 1
holds protected full ref tier 3 pages and they stay long enough to get
promoted as tier 3 again, so they will always be kept in memory.
Now MongoDB will perform well even without refault distance series,
but this period may take a long time (~15 min for the MongoDB test for
SATA SSD, which is based on a real workload), long enough to cause
real issue.

And this also means PID won't react to workload change fast enough.

Also the anon refault's refs value is adjusted by refault distance too
in the series, it tries to split the whole LRU as at least two gens
for refaulted pages (only page with refault distance < LRU /
MIN_NR_GEN will have full refs set, else will have refs - 1 set as
penalty for long time evicted and unused page, which complies with
LRU's nature). Which seems actually decreased refault of anon pages.

There are some other issue that refault distance series is trying to
solve too, eg. if there is a user agent force MGLRU to age
periodically for proactive memory reclaim, or MGLRU simply ages fast,
min_seq will grow periodically and PID won't catch enough feedback
using previous logic.