On Wed, Jan 23, 2019 at 05:31:44PM -0500, Chris Down wrote:
memory.stat and other files already consider subtrees in their output,
and we should too in order to not present an inconsistent interface.
The current situation is fairly confusing, because people interacting
with cgroups expect hierarchical behaviour in the vein of memory.stat,
cgroup.events, and other files. For example, this causes confusion when
debugging reclaim events under low, as currently these always read "0"
at non-leaf memcg nodes, which frequently causes people to misdiagnose
breach behaviour. The same confusion applies to other counters in this
file when debugging issues.
Aggregation is done at write time instead of at read-time since these
counters aren't hot (unlike memory.stat which is per-page, so it does it
at read time), and it makes sense to bundle this with the file
notifications.
I agree with the consistency argument (matching cgroup.events, ...),
and it's definitely looks better for oom* events, but at the same time it feels
like a API break.
Just for example, let's say you have a delegated sub-tree with memory.max
set. Earlier, getting memory.high/max event meant that the whole sub-tree
is tight on memory, and, for example, led to shutdown of some parts of the tree.
After your change, it might mean that some sub-cgroup has reached its limit,
and probably doesn't matter on the top level.
Maybe it's still ok, but we definitely need to document it better. It feels
bad that different versions of the kernel will handle it differently, so
the userspace has to workaround it to actually use these events.
Also, please, make sure that it doesn't break memcg kselftests.
We don't have memory.events file for the root cgroup, so we can stop earlier.