Re: [RFC REPOST] cgroup: removing css reference drain wait duringcgroup removal

From: Glauber Costa
Date: Fri Mar 16 2012 - 06:22:54 EST




I thought of this a little yesterday. Current my idea is applying following
rule for res_counter.

1. All res_counter is hierarchical. But behavior should be optimized.

2. If parent res_counter has UNLIMITED limit, 'usage' will not be propagated
to its parent at _charge_.

That doesn't seem to make much sense. If you are unlimited, but your
parent is limited,
he has a lot more interest to know about the charge than you do.


Sorry, I should write "If all ancestors are umlimited'.
If parent is limited, the children should be treated as limited.

So the
logic should rather be the opposite: Don't go around getting locks and
all that if you are unlimited. Your parent might, though.

I am trying to experiment a bit with billing to percpu counters for
unlimited res_counters. But their inexact nature is giving me quite a
headache.



Personally, I think percpu counter is not the best one. Yes, it will work but...
Because of its nature of error range, it has scalability problem. Considering
to have a tree like

/A/B/Guest0/tasks
Guest1/tasks
Guest2/tasks
Guest4/tasks
Guest5/tasks
......

percpu res_counter may work scarable in GuestX level but will conflict in level B.
And I don't want to think what happens in 256 cpu system. Error in B will be
very big.

Usually the recommendation in this case is to follow percpu_counter_read() with percpu_counter_sum()

If we additionally wrap the updaters in rcu_read_lock(), we can sum it up when needed with relative confidence. It does have performance scalability problems in big systems, but no bigger than we have today.

The advantage of percpu counters, is that we don't need to think in terms of "unlimited", but rather, in terms of "close enough to the limit".

Here is an excerpt of a patch I am experimenting with

1: usage = percpu_counter_read(&c->usage_pcp);

if (percpu_counter_read(&c->usage_pcp) + val <
c->limit + num_online_cpus() * percpu_counter_batch) {
5: percpu_counter_add(&c->usage_pcp, val);
rcu_read_unlock();
return 0;
}
rcu_read_unlock();
10:
raw_spin_lock(&c->usage_pcp.lock);
usage = __percpu_counter_sum_locked(&c->usage_pcp);

if (usage + val > c->limit) {
5: c->failcnt++;
ret = -ENOMEM;
goto out;
}

20: usage += val;
c->usage_pcp.count = usage;
if (usage > c->usage_pcp.max)
c->usage_pcp.max = usage;

25: out:
raw_spin_unlock(&c->usage_pcp.lock);
return ret;


So we probe the current counter, and if we are unlimited, or not close to the limits, we update it per-cpu, and let it go.

If we believe we're in a suspicious area, we take the lock (as a quick hack, I am holding the percpu lock directly), and then proceed everything under the lock.

In the unlimited case, we'll always be writing to the percpu storage.

Note, also, that this is always either under a spinlock, or rcu marked area.

This can also possibly be improved by an unfrequent slow-path update in which once we start reading from percpu_counter_sum, we flip a res_counter bit to mark that, and then we start dealing with the usage_pcp.count directly, without any percpu. This would work exactly like the res_counters today, so it is kind of a fallback mode.

For readers, a combination of synchronize_rcu() + spin_lock() on the reader side should be enough to guarantee that the result of percpu_counter_sum() is reliable enough.

Also please note, that our read-side these days is not that good as well: because the way we do caching in memcg, we can end up with the *exact* situation as this proposal. If the same memcg is being updated in all cpus, we can have up to 32 * nr_cpus() pages in the cache, that are not really used by memcg.

I actually have patches to force a drain_all_caches_sync() before reads, but I am holding them waiting for something to happen in this discussion.



Another idea is to borrow a resource from memcg to the tasks. i.e.having per-task
caching of charges. But it has two problems that draining unused resource is difficult
and precise usage is unknown.

IMHO, hard-limited resource counter itself may be a problem ;)
Yes, it is.

Indeed, if percpu charging as I described above is still not acceptable, our way out of this may be heavier caching. Maybe we need per-level cache, or something like that.
Per-task may get tricky, specially when we start having charges that can't be directly related to a task, like slab objects.

So, an idea, 'if all ancestors are unlimited, don't propagate charges.'
comes to my mind. With this, people use resource in FLAT (but has hierarchical cgroup
tree) will not see any performance problem.



3. If a res_counter has UNLIMITED limit, at reading usage, it must visit
all children and returns a sum of them.

Then,
/cgroup/
memory/ (unlimited)
libivirt/ (unlimited)
qeumu/ (unlimited)
guest/(limited)

All dir can show hierarchical usage and the guest will not have
any lock contention at runtime.

If we are okay with summing it up at read time, we may as well
keep everything in percpu counters at all times.



If all ancestors are unlimited, we don't need to propagate usage upwards
at charging. If one of ancestors are limited, we need to propagate and
check usage at charging.

The only problem is that "one of ancestors is limited" is expected to be quite a common case. So by optimizing for unlimited, I think we're tricking ourselves into believing we're solving anything, when in reality we're not.




- memory.use_hierarchy should be obsolete ?
If we're going fully hierarchical, yes.


Another big problem is 'when' we should do this change..
Maybe this 'hierarchical' problem will be good topic in MM summit.

we're in no shortage of topics! =)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/