Re: [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock

From: Michal Hocko
Date: Tue Aug 09 2011 - 03:26:22 EST


On Tue 09-08-11 01:19:12, Johannes Weiner wrote:
> On Mon, Aug 08, 2011 at 11:47:04PM +0200, Michal Hocko wrote:
> > On Mon 08-08-11 20:47:38, Johannes Weiner wrote:
> > [...]
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -2071,7 +2071,6 @@ struct memcg_stock_pcp {
> > > > #define FLUSHING_CACHED_CHARGE (0)
> > > > };
> > > > static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> > > > -static DEFINE_MUTEX(percpu_charge_mutex);
> > > >
> > > > /*
> > > > * Try to consume stocked charge on this cpu. If success, one page is consumed
> > > > @@ -2178,7 +2177,8 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> > > >
> > > > for_each_online_cpu(cpu) {
> > > > struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > > > - if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > > > + if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> > > > + test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > > > flush_work(&stock->work);
> > > > }
> > > > out:
> > >
> > > This hunk triggers a crash for me, as the draining is already done and
> > > stock->cached reset to NULL when dereferenced here. Oops is attached.
> >
> > Thanks for catching this. We are racing synchronous drain from
> > force_empty and async drain from reclaim, I guess.
>
> It's at the end of a benchmark where several tasks delete the cgroups.
> There is no reclaim going on anymore, it must be several sync drains
> from force_empty of different memcgs.

I am afraid it is worse than that. Sync. drainer can kick itself really
trivial just by doing local draining. Then stock->cached is guaranteed
to be NULL when we reach this...

[...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f4ec4e7..626c916 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2197,8 +2197,10 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> >
> > for_each_online_cpu(cpu) {
> > struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > - if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> > - test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > + struct mem_cgroup *mem = stock->cached;
> > + if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags) &&
> > + mem && mem_cgroup_same_or_subtree(root_mem, mem)
> > + )
> > flush_work(&stock->work);
> > }
> > out:
>
> This ordering makes sure that mem is a sensible pointer, but it still
> does not pin the object, *mem, which could go away the femtosecond
> after the test_bit succeeds.

Yes there are possible races
CPU0 CPU1 CPU2
mem = stock->cached
stock->cached = NULL
clear_bit
test_and_set_bit
test_bit()
<preempted> ...
mem_cgroup_destroy
use after free

The race is not very probable because we are doing quite a bunch of work
before we can deallocate mem. mem_cgroup_destroy is called after
synchronize_rcu so we can close the race by rcu_read_lock, right?


[...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f4ec4e7..eca46141 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2179,17 +2179,23 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> > struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > struct mem_cgroup *mem;
> >
> > + /* Try to lock the cache */
> > + if(test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > + continue;
> > +
> > mem = stock->cached;
> > if (!mem || !stock->nr_pages)
> > - continue;
> > + goto unlock_cache;
> > if (!mem_cgroup_same_or_subtree(root_mem, mem))
> > - continue;
> > - if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> > - if (cpu == curcpu)
> > - drain_local_stock(&stock->work);
> > - else
> > - schedule_work_on(cpu, &stock->work);
> > - }
> > + goto unlock_cache;
>
> So one thread locks the cache, recognizes stock->cached is not in the
> right hierarchy and unlocks again. While the cache was locked, a
> concurrent drainer with the right hierarchy skipped the stock because
> it was locked. That doesn't sound right.

You are right. We have to make it less parallel.

>
> But yes, we probably need exclusive access to the stock state.
>
> > +
> > + if (cpu == curcpu)
> > + drain_local_stock(&stock->work);
> > + else
> > + schedule_work_on(cpu, &stock->work);
> > + continue;
> > +unlock_cache:
> > + clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> >
> > ^^^^^
> > need a barrier?
> > }
> >
> > if (!sync)
> >
> > > Without the mutex serializing this code, can't there be a concurrent
> > > execution that leads to stock->cached being drained, becoming empty
> > > and freed by someone else between the stock->nr_pages check and the
> > > ancestor check, resulting in use after free?
> > >
> > > What makes stock->cached safe to dereference?
> >
> > We are using FLUSHING_CACHED_CHARGE as a lock for local draining. I
> > guess it should be sufficient.
> >
> > mutex which was used previously caused that async draining was exclusive
> > so a root_mem that has potentially many relevant caches has to back off
> > because other mem wants to clear the cache on the same CPU.
>
> It's now replaced by what is essentially a per-stock bit-spinlock that
> is always trylocked.

Yes.

>
> Would it make sense to promote it to a real spinlock? Draining a
> stock is pretty fast, there should be minimal lock hold times, but we
> would still avoid that tiny race window where we would skip otherwise
> correct stocks just because they are locked.

Yes, per-stock spinlock will be easiest way because if tried other games
with atomics we would endup with a more complicated refill_stock which
can race with draining as well.

> > I will think about this tomorrow (with fresh eyes). I think we should be
> > able to be without mutex.
>
> The problem is that we have a multi-op atomic section, so we can not
> go lockless. We can read the stock state just fine, and order
> accesses to different members so that we get a coherent image. But
> there is still nothing that pins the charge to the memcg, and thus
> nothing that stabilizes *stock->cached.
>
> I agree that we can probably do better than a global lock, though.

Fully agreed. I will send a patch after I give it some testing.

Thanks!
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/