On Mon, Nov 6, 2023 at 12:37 PM Waiman Long <longman@xxxxxxxxxx> wrote:Maybe, but I prefer to keep them separate for now.
On 11/6/23 15:07, Yosry Ahmed wrote:Would the diff be simpler if patches 1 & 2 were squashed?
On Fri, Nov 3, 2023 at 8:13 PM Waiman Long <longman@xxxxxxxxxx> wrote:I agree. The diff is really hard to look at. It will be easier to apply
The current design of cgroup_rstat_cpu_pop_updated() is to traverseI don't know why git decided to show this diff in the most confusing
the updated tree in a way to pop out the leaf nodes first before
their parents. This can cause traversal of multiple nodes before a
leaf node can be found and popped out. IOW, a given node in the tree
can be visited multiple times before the whole operation is done. So
it is not very efficient and the code can be hard to read.
With the introduction of cgroup_rstat_updated_list() to build a list
of cgroups to be flushed first before any flushing operation is being
done, we can optimize the way the updated tree nodes are being popped
by pushing the parents first to the tail end of the list before their
children. In this way, most updated tree nodes will be visited only
once with the exception of the subtree root as we still need to go
back to its parent and popped it out of its updated_children list.
This also makes the code easier to read.
A parallel kernel build on a 2-socket x86-64 server is used as the
benchmarking tool for measuring the lock hold time. Below were the lock
hold time frequency distribution before and after the patch:
Hold time Before patch After patch
--------- ------------ -----------
0-01 us 13,738,708 14,594,545
01-05 us 1,177,194 439,926
05-10 us 4,984 5,960
10-15 us 3,562 3,543
15-20 us 1,314 1,397
20-25 us 18 25
25-30 us 12 12
It can be seen that the patch pushes the lock hold time towards the
lower end.
Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
way possible.
the patch & looks at the actual rstat.c file.
[..]
Yes you are right. It's definitely not free, but it's also not superThanks for the comment. I will rework the comment to clarify that a bit*IIUC updated_children and updated_next is the same list.
* The only ordering guarantee is that, for a parent and a child pair
- * covered by a given traversal, if a child is visited, its parent is
- * guaranteed to be visited afterwards.
+ * covered by a given traversal, the child is before its parent in
+ * the list.
+ *
+ * Note that updated_children is self terminated while updated_next is
+ * parent cgroup terminated except the cgroup root which can be self
+ * terminated.
updated_children is the head, and updated_next is how the list items
are linked. This comment makes it seem like they are two different
lists.
more.
I am actually wondering if it's worth using the singly linked listIt will cost an additional 16 bytes of percpu memory if converted to
here. We are saving 8 bytes percpu, but the semantics are fairly
confusing. Wouldn't this be easier to reason about if you just use
list_head?
updated_children would be replaced with LIST_HEAD (or similar), and
the list would be NULL terminated instead of terminated by self/parent
cgroup. IIUC the reason it's not NULL-terminated now is because we use
cgroup->updated_next to check quickly if a cgroup is on the list or
not. If we use list_heads, we can just use list_emtpy() IIUC.
We can also simplify the semantics of unlinking @root from the updated
tree below, it would just be list_del() IIUC, which is actually more
performant as well. It seems like overall we would simplify a lot of
things. When forming the updated_list, we can just walk the tree and
splice the lists in the correct order.
It seems to me that saving 8 bytes percpu is not worth the complexity
of the custom list semantics here. Am I missing something here?
list_heads. Like other lists, there will be sibling and children
list_heads. There are also 2 pointers to update instead of one. Anyway,
I don't have an objection to convert them to list_heads if agreed by Tejun.
costly. It's just that every time I look at the rstat code I need to
remind myself of how updated_next and updated_children work. I will
let Tejun decide.