Re: [patch v2] mm: memcontrol: convert reclaim iterator to simple css refcounting

From: Johannes Weiner
Date: Wed Sep 24 2014 - 13:17:07 EST


On Wed, Sep 24, 2014 at 06:47:39PM +0200, Michal Hocko wrote:
> On Fri 19-09-14 17:28:43, Johannes Weiner wrote:
> > From: Johannes Weiner <hannes@xxxxxxxxxxx>
> > Date: Fri, 19 Sep 2014 12:39:18 -0400
> > Subject: [patch v2] mm: memcontrol: convert reclaim iterator to simple css
> > refcounting
> >
> > The memcg reclaim iterators use a complicated weak reference scheme to
> > prevent pinning cgroups indefinitely in the absence of memory pressure.
> >
> > However, during the ongoing cgroup core rework, css lifetime has been
> > decoupled such that a pinned css no longer interferes with removal of
> > the user-visible cgroup, and all this complexity is now unnecessary.
>
> I very much welcome simplification in this area but I would also very much
> appreciate more details why some checks are no longer needed. Why don't
> we need ->generation or (next_css->flags & CSS_ONLINE) check anymore?

Vladimir pointed out that the generation was still needed, I added it
back and will submit version 2 after the lockless counters have been
sorted out.

Argh, I thought CSS_ONLINE was an artifact obsoleted by the
css_tryget_online() conversion. That's quite the handgrenade.

Tejun, should maybe the iterators not return css before they have
CSS_ONLINE set? It seems odd to have memcg reach into cgroup like
that to check if published objects are actually fully initialized.
Background is this patch:

commit d8ad30559715ce97afb7d1a93a12fd90e8fff312
Author: Hugh Dickins <hughd@xxxxxxxxxx>
Date: Thu Jan 23 15:53:32 2014 -0800

mm/memcg: iteration skip memcgs not yet fully initialized

It is surprising that the mem_cgroup iterator can return memcgs which
have not yet been fully initialized. By accident (or trial and error?)
this appears not to present an actual problem; but it may be better to
prevent such surprises, by skipping memcgs not yet online.

Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Tejun Heo <tj@xxxxxxxxxx>
Acked-by: Michal Hocko <mhocko@xxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: <stable@xxxxxxxxxxxxxxx> [3.12+]
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>

> > rcu_read_lock();
> > - while (!memcg) {
> > - struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
> > - int uninitialized_var(seq);
> >
> > - if (reclaim) {
> > - struct mem_cgroup_per_zone *mz;
> > + if (reclaim) {
> > + mz = mem_cgroup_zone_zoneinfo(root, reclaim->zone);
> > + priority = reclaim->priority;
> >
> > - mz = mem_cgroup_zone_zoneinfo(root, reclaim->zone);
> > - iter = &mz->reclaim_iter[reclaim->priority];
> > - if (prev && reclaim->generation != iter->generation) {
> > - iter->last_visited = NULL;
> > - goto out_unlock;
> > - }
> > -
> > - last_visited = mem_cgroup_iter_load(iter, root, &seq);
> > - }
> > -
> > - memcg = __mem_cgroup_iter_next(root, last_visited);
> > + do {
> > + pos = ACCESS_ONCE(mz->reclaim_iter[priority]);
> > + } while (pos && !css_tryget(&pos->css));
>
> This is a bit confusing. AFAIU css_tryget fails only when the current
> ref count is zero already. When do we keep cached memcg with zero count
> behind? We always do css_get after cmpxchg.
>
> Hmm, there is a small window between cmpxchg and css_get when we store
> the current memcg into the reclaim_iter[priority]. If the current memcg
> is root then we do not take any css reference before cmpxchg and so it
> might drop down to zero in the mean time so other CPU might see zero I
> guess. But I do not see how css_get after cmpxchg on such css works.
> I guess I should go and check the css reference counting again.

It's not about root or the newly stored memcg, it's that you might
read the position right before it's replaced and css_put(), at which
point the css_tryget() may fail and you have to reload the position.

I'll add a comment.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/