Re: [PATCH v3 5/9] mm/workingset: use the node counter if memcg is the root memcg

From: Joonsoo Kim
Date: Thu Mar 19 2020 - 02:24:06 EST


2020ë 3ì 19ì (ë) ìì 4:18, Johannes Weiner <hannes@xxxxxxxxxxx>ëì ìì:
>
> On Tue, Mar 17, 2020 at 02:41:53PM +0900, js1304@xxxxxxxxx wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
> >
> > In the following patch, workingset detection is implemented for the
> > swap cache. Swap cache's node is usually allocated by kswapd and it
> > isn't charged by kmemcg since it is from the kernel thread. So the swap
> > cache's shadow node is managed by the node list of the list_lru rather
> > than the memcg specific one.
> >
> > If counting the shadow node on the root memcg happens to reclaim the slab
> > object, the shadow node count returns the number of the shadow node on
> > the node list of the list_lru since root memcg has the kmem_cache_id, -1.
> >
> > However, the size of pages on the LRU is calculated by using the specific
> > memcg, so mismatch happens. This causes the number of shadow node not to
> > be increased to the enough size and, therefore, workingset detection
> > cannot work correctly. This patch fixes this bug by checking if the memcg
> > is the root memcg or not. If it is the root memcg, instead of using
> > the memcg-specific LRU, the system-wide LRU is used to calculate proper
> > size of the shadow node so that the number of the shadow node can grow
> > as expected.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
> > ---
> > mm/workingset.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index 5fb8f85..a9f474a 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -468,7 +468,13 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> > * PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE
> > */
> > #ifdef CONFIG_MEMCG
> > - if (sc->memcg) {
> > + /*
> > + * Kernel allocation on root memcg isn't regarded as allocation of
> > + * specific memcg. So, if sc->memcg is the root memcg, we need to
> > + * use the count for the node rather than one for the specific
> > + * memcg.
> > + */
> > + if (sc->memcg && !mem_cgroup_is_root(sc->memcg)) {
>
> This is no good, unfortunately.
>
> It allows the root cgroup's shadows to grow way too large. Consider a
> large memory system where several workloads run in containers and only
> some host software runs in the root, yet that tiny root group will
> grow shadow entries in proportion to the entire RAM.

Okay.

> IMO, we have some choices here:
>
> 1. We say the swapcache is a shared system facility and its memory is
> not accounted to anyone. In that case, we should either
> 1a. Reclaim them to a fixed threshold, regardless of cgroup, or
> 1b. Not reclaim them at all. Or
> 2. We account all nodes to the groups for which they are allocated.
> Essentially like this:
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 8e7ce9a9bc5e..d0d0dcc357fb 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -125,6 +125,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
> page_ref_add(page, nr);
> SetPageSwapCache(page);
>
> + memalloc_use_memcg(page_memcg(page));
> do {
> xas_lock_irq(&xas);
> xas_create_range(&xas);
> @@ -142,6 +143,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
> unlock:
> xas_unlock_irq(&xas);
> } while (xas_nomem(&xas, gfp));
> + memalloc_unuse_memcg();
>
> if (!xas_error(&xas))
> return 0;
> @@ -605,7 +607,8 @@ int init_swap_address_space(unsigned int type, unsigned long nr_pages)
> return -ENOMEM;
> for (i = 0; i < nr; i++) {
> space = spaces + i;
> - xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
> + xa_init_flags(&space->i_pages,
> + XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
> atomic_set(&space->i_mmap_writable, 0);
> space->a_ops = &swap_aops;
> /* swap cache doesn't use writeback related tags */
>
> (A reclaimer has PF_MEMALLOC set, so we'll bypass the limit when
> recursing into charging the node.)
>
> I'm leaning more toward 1b, actually. The shadow shrinker was written
> because the combined address space of files on the filesystem can
> easily be in the terabytes, and practically unbounded with sparse
> files. The shadow shrinker is there to keep users from DoSing the
> system with shadow entries for files.
>
> However, the swap address space is bounded by a privileged user. And
> the size is usually in the GB range. On my system, radix_tree_node is
> ~583 bytes, so a for a 16G swapfile, the swapcache xarray should max
> out below 40M (36M worth of leaf nodes, plus some intermediate nodes).
>
> It doesn't seem worth messing with the shrinker at all for these.

40M / 16G, 0.25% of the amount of the used swap looks okay to me.
I will rework the patch on that way.

Thanks.