Re: [PATCH v3 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()

From: Yosry Ahmed
Date: Mon Nov 06 2023 - 15:08:11 EST


On Fri, Nov 3, 2023 at 8:13 PM Waiman Long <longman@xxxxxxxxxx> wrote:
>
> The current design of cgroup_rstat_cpu_pop_updated() is to traverse
> the updated tree in a way to pop out the leaf nodes first before
> their parents. This can cause traversal of multiple nodes before a
> leaf node can be found and popped out. IOW, a given node in the tree
> can be visited multiple times before the whole operation is done. So
> it is not very efficient and the code can be hard to read.
>
> With the introduction of cgroup_rstat_updated_list() to build a list
> of cgroups to be flushed first before any flushing operation is being
> done, we can optimize the way the updated tree nodes are being popped
> by pushing the parents first to the tail end of the list before their
> children. In this way, most updated tree nodes will be visited only
> once with the exception of the subtree root as we still need to go
> back to its parent and popped it out of its updated_children list.
> This also makes the code easier to read.
>
> A parallel kernel build on a 2-socket x86-64 server is used as the
> benchmarking tool for measuring the lock hold time. Below were the lock
> hold time frequency distribution before and after the patch:
>
> Hold time Before patch After patch
> --------- ------------ -----------
> 0-01 us 13,738,708 14,594,545
> 01-05 us 1,177,194 439,926
> 05-10 us 4,984 5,960
> 10-15 us 3,562 3,543
> 15-20 us 1,314 1,397
> 20-25 us 18 25
> 25-30 us 12 12
>
> It can be seen that the patch pushes the lock hold time towards the
> lower end.
>
> Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
> ---

I don't know why git decided to show this diff in the most confusing
way possible.

> kernel/cgroup/rstat.c | 132 ++++++++++++++++++++++--------------------
> 1 file changed, 70 insertions(+), 62 deletions(-)
>
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index 1f300bf4dc40..d2b709cfeb2a 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -74,64 +74,90 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
> }
>
> /**
> - * cgroup_rstat_cpu_pop_updated - iterate and dismantle rstat_cpu updated tree
> - * @pos: current position
> - * @root: root of the tree to traversal
> + * cgroup_rstat_push_children - push children cgroups into the given list
> + * @head: current head of the list (= parent cgroup)
> + * @prstatc: cgroup_rstat_cpu of the parent cgroup
> * @cpu: target cpu
> + * Return: A new singly linked list of cgroups to be flush
> *
> - * Walks the updated rstat_cpu tree on @cpu from @root. %NULL @pos starts
> - * the traversal and %NULL return indicates the end. During traversal,
> - * each returned cgroup is unlinked from the tree. Must be called with the
> - * matching cgroup_rstat_cpu_lock held.
> + * Recursively traverse down the cgroup_rstat_cpu updated tree and push
> + * parent first before its children. The parent is pushed by the caller.

I think it might be useful here (and elsewhere in the patch) where
"push" is being used to elaborate that we push to the beginning in a
stack-like fashion.

> + * The recursion depth is the depth of the current updated tree.
> + */
> +static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
> + struct cgroup_rstat_cpu *prstatc, int cpu)
> +{
> + struct cgroup *child, *parent;
> + struct cgroup_rstat_cpu *crstatc;
> +
> + parent = head;
> + child = prstatc->updated_children;
> + prstatc->updated_children = parent;
> +
> + /* updated_next is parent cgroup terminated */
> + while (child != parent) {
> + child->rstat_flush_next = head;
> + head = child;
> + crstatc = cgroup_rstat_cpu(child, cpu);
> + if (crstatc->updated_children != parent)

I think cgroup->updated_children is set to the cgroup itself if it's
empty, right? Shouldn't this be crstatc->updated_children != child?

> + head = cgroup_rstat_push_children(head, crstatc, cpu);
> + child = crstatc->updated_next;
> + crstatc->updated_next = NULL;
> + }
> + return head;
> +}
> +
> +/**
> + * cgroup_rstat_updated_list - return a list of updated cgroups to be flushed
> + * @root: root of the cgroup subtree to traverse
> + * @cpu: target cpu
> + * Return: A singly linked list of cgroups to be flushed
> + *
> + * Walks the updated rstat_cpu tree on @cpu from @root. During traversal,
> + * each returned cgroup is unlinked from the updated tree. Must be called
> + * with the matching cgroup_rstat_cpu_lock held.

This function takes care of holding the lock actually. I think that
sentence should be applied to cgroup_rstat_push_children() above?

> *
> * The only ordering guarantee is that, for a parent and a child pair
> - * covered by a given traversal, if a child is visited, its parent is
> - * guaranteed to be visited afterwards.
> + * covered by a given traversal, the child is before its parent in
> + * the list.
> + *
> + * Note that updated_children is self terminated while updated_next is
> + * parent cgroup terminated except the cgroup root which can be self
> + * terminated.

IIUC updated_children and updated_next is the same list.
updated_children is the head, and updated_next is how the list items
are linked. This comment makes it seem like they are two different
lists.

I am actually wondering if it's worth using the singly linked list
here. We are saving 8 bytes percpu, but the semantics are fairly
confusing. Wouldn't this be easier to reason about if you just use
list_head?

updated_children would be replaced with LIST_HEAD (or similar), and
the list would be NULL terminated instead of terminated by self/parent
cgroup. IIUC the reason it's not NULL-terminated now is because we use
cgroup->updated_next to check quickly if a cgroup is on the list or
not. If we use list_heads, we can just use list_emtpy() IIUC.

We can also simplify the semantics of unlinking @root from the updated
tree below, it would just be list_del() IIUC, which is actually more
performant as well. It seems like overall we would simplify a lot of
things. When forming the updated_list, we can just walk the tree and
splice the lists in the correct order.

It seems to me that saving 8 bytes percpu is not worth the complexity
of the custom list semantics here. Am I missing something here?