Re: Cgroup memory barrier usage and call frequency from scheduler

From: Tejun Heo
Date: Thu Apr 09 2020 - 13:56:28 EST


Hello, Mel.

On Thu, Apr 09, 2020 at 04:44:13PM +0100, Mel Gorman wrote:
> Commit 9a9e97b2f1f2 ("cgroup: Add memory barriers to plug
> cgroup_rstat_updated() race window") introduced two full memory
> barriers to close a race. The one in cgroup_rstat_updated can be
> called at a high frequency from the scheduler from update_curr ->
> cgroup_account_cputime. The patch has no cc's, acks or reviews so I'm
> not sure how closely this was looked at. cgroup_rstat_updated shows up
> in profiles of netperf UDP_STREAM accounting for about 1% of overhead

Oops, that's pretty high.

> which doesn't sound a lot but that's about the same weight as some of
> the critical network paths. I have three questions about the patch
>
> 1. Why were full barriers used?

Given

A C
--- ---
B D

the code is trying to guarantee that either B sees C or D sees A, so it does
need full ordering.

> 2. Why was it important that the data race be closed when the inaccuracy
> is temporary?

There was a pending patchset which converted memcg to use rstat and the
conversion included the event counters which needed to be synchronous (e.g.
for things like oom kill counts). The patchset didn't make it through due to
the percpu memory overhead at the time. The memory overhead issue can be
resolved now but in the meantime memcg got improved in a different way which
made the rstat conversion not immediately necessary, so it fell through the
cracks. In retrospect, this patch shouldn't have been committed on its own or
at least the synchronous and pure state update paths should have been
separate.

> 3. Why is it called from the context of update_curr()?

It's just being callled from the path which udpates sched statistics.

> For 1, the use of a full barrier seems unnecessary when it appears that
> you could have used a read barrier and a write barrier. The following
> patch drops the profile overhead to 0.1%

I'm not sure this is correct but that's kinda irrelevant.

> For 2, the changelog says the barriers are necessary because "we plan to use
> rstat to track counters which need to be accurate". That is a bit vague.
> Under what circumstances is a transient inaccuracy a serious enough
> problem to justify additional barriers in the scheduler?

Hope this is explained now.

> For 3, update_curr() is called from a lot of places, some of which are
> quite hot -- e.g. task enqueue/dequeue. This is necessary information from
> the runqueue needs to be preserved. However, it's less clear that the cpu
> accounting information needs to be up to date on this granularity although
> it might be related to question 2. Why was the delta_exec not similarly
> accumulated in cpuacct_change() and defer the hierarchical update to
> be called from somewhere like entity_tick()? It would need tracking the
> CPU time at the last update as delta_exec would be lost so it's not very
> trivial but it does not look like it would be overly complicated.

Most likely historic. The code has been there for a long time and the only
recent changes were plumbing around them. Nothing in cpuacct needs to be
per-scheduling-event accurate, so yeah, for the longer term, it'd be a good
idea to move them out of hot path.

For now, I'll revert the patch. Nothing in tree needs that right now. If the
need for synchronous counting comes back later, I'll make that a separate
path.

Thanks.

--
tejun