Re: [PATCH v6 1/3] mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim

From: Yosry Ahmed
Date: Thu Apr 13 2023 - 06:46:26 EST


On Thu, Apr 13, 2023 at 3:40 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> We keep track of different types of reclaimed pages through
> reclaim_state->reclaimed_slab, and we add them to the reported number
> of reclaimed pages. For non-memcg reclaim, this makes sense. For memcg
> reclaim, we have no clue if those pages are charged to the memcg under
> reclaim.
>
> Slab pages are shared by different memcgs, so a freed slab page may have
> only been partially charged to the memcg under reclaim. The same goes for
> clean file pages from pruned inodes (on highmem systems) or xfs buffer
> pages, there is no simple way to currently link them to the memcg under
> reclaim.
>
> Stop reporting those freed pages as reclaimed pages during memcg reclaim.
> This should make the return value of writing to memory.reclaim, and may
> help reduce unnecessary reclaim retries during memcg charging. Writing to
> memory.reclaim on the root memcg is considered as cgroup_reclaim(), but
> for this case we want to include any freed pages, so use the
> global_reclaim() check instead of !cgroup_reclaim().
>
> Generally, this should make the return value of
> try_to_free_mem_cgroup_pages() more accurate. In some limited cases (e.g.
> freed a slab page that was mostly charged to the memcg under reclaim),
> the return value of try_to_free_mem_cgroup_pages() can be underestimated,
> but this should be fine. The freed pages will be uncharged anyway, and we
> can charge the memcg the next time around as we usually do memcg reclaim
> in a retry loop.
>
> Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects
> instead of pages")


Andrew, I removed the CC: stable as you were sceptical about the need
for a backport, but left the Fixes tag so that it's easy to identify
where to backport it if you and/or stable maintainers decide
otherwise.

>
>
> Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> ---
> mm/vmscan.c | 49 ++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 42 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9c1c5e8b24b8..be657832be48 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -511,6 +511,46 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> }
> #endif
>
> +/*
> + * flush_reclaim_state(): add pages reclaimed outside of LRU-based reclaim to
> + * scan_control->nr_reclaimed.
> + */
> +static void flush_reclaim_state(struct scan_control *sc)
> +{
> + /*
> + * Currently, reclaim_state->reclaimed includes three types of pages
> + * freed outside of vmscan:
> + * (1) Slab pages.
> + * (2) Clean file pages from pruned inodes (on highmem systems).
> + * (3) XFS freed buffer pages.
> + *
> + * For all of these cases, we cannot universally link the pages to a
> + * single memcg. For example, a memcg-aware shrinker can free one object
> + * charged to the target memcg, causing an entire page to be freed.
> + * If we count the entire page as reclaimed from the memcg, we end up
> + * overestimating the reclaimed amount (potentially under-reclaiming).
> + *
> + * Only count such pages for global reclaim to prevent under-reclaiming
> + * from the target memcg; preventing unnecessary retries during memcg
> + * charging and false positives from proactive reclaim.
> + *
> + * For uncommon cases where the freed pages were actually mostly
> + * charged to the target memcg, we end up underestimating the reclaimed
> + * amount. This should be fine. The freed pages will be uncharged
> + * anyway, even if they are not counted here properly, and we will be
> + * able to make forward progress in charging (which is usually in a
> + * retry loop).
> + *
> + * We can go one step further, and report the uncharged objcg pages in
> + * memcg reclaim, to make reporting more accurate and reduce
> + * underestimation, but it's probably not worth the complexity for now.
> + */
> + if (current->reclaim_state && global_reclaim(sc)) {
> + sc->nr_reclaimed += current->reclaim_state->reclaimed;
> + current->reclaim_state->reclaimed = 0;
> + }
> +}
> +
> static long xchg_nr_deferred(struct shrinker *shrinker,
> struct shrink_control *sc)
> {
> @@ -5346,8 +5386,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
> vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
> sc->nr_reclaimed - reclaimed);
>
> - sc->nr_reclaimed += current->reclaim_state->reclaimed_slab;
> - current->reclaim_state->reclaimed_slab = 0;
> + flush_reclaim_state(sc);
>
> return success ? MEMCG_LRU_YOUNG : 0;
> }
> @@ -6450,7 +6489,6 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>
> static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> {
> - struct reclaim_state *reclaim_state = current->reclaim_state;
> unsigned long nr_reclaimed, nr_scanned;
> struct lruvec *target_lruvec;
> bool reclaimable = false;
> @@ -6472,10 +6510,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>
> shrink_node_memcgs(pgdat, sc);
>
> - if (reclaim_state) {
> - sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> - reclaim_state->reclaimed_slab = 0;
> - }
> + flush_reclaim_state(sc);
>
> /* Record the subtree's reclaim efficiency */
> if (!sc->proactive)
> --
> 2.40.0.577.gac1e443424-goog
>