Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

From: Waiman Long
Date: Tue Jul 12 2016 - 14:51:45 EST


On 07/12/2016 10:27 AM, Tejun Heo wrote:
Hello,

On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote:
The percpu APIs are extensively used in the Linux kernel to reduce
cacheline contention and improve performance. For some use cases, the
percpu APIs may be too fine-grain for distributed resources whereas
a per-node based allocation may be too coarse as we can have dozens
of CPUs in a NUMA node in some high-end systems.

This patch introduces a simple per-subnode APIs where each of the
distributed resources will be shared by only a handful of CPUs within
a NUMA node. The per-subnode APIs are built on top of the percpu APIs
and hence requires the same amount of memory as if the percpu APIs
are used. However, it helps to reduce the total number of separate
resources that needed to be managed. As a result, it can speed up code
that need to iterate all the resources compared with using the percpu
APIs. Cacheline contention, however, will increases slightly as each
resource is shared by more than one CPU. As long as the number of CPUs
in each subnode is small, the performance impact won't be significant.

In this patch, at most 2 sibling groups can be put into a subnode. For
an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled
and 2 when it is not.
I understand that there's a trade-off between local access and global
traversing and you're trying to find a sweet spot between the two, but
this seems pretty arbitrary. What's the use case? What are the
numbers? Why are global traversals often enough to matter so much?

The last 2 RFC patches were created in response to Andi's comment to have coarser granularity than per-cpu. In this particular use case, I don't think global list traversals are frequent enough to really have any noticeable performance impact. So I don't have any benchmark number to support this change. However, it may not be true for other future use cases.

These 2 patches were created to gauge if using a per-subnode API for this use case is a good idea or not. I am perfectly happy to keep it as per-cpu and scrap the last 2 RFC patches. My main goal is to make this patchset more acceptable to be moved forward instead of staying in limbo.

Cheers,
Longman