Re: [PATCH v5 08/24] sched: Introduce per memory space current virtual cpu id

From: Mathieu Desnoyers
Date: Tue Nov 08 2022 - 15:07:41 EST


On 2022-11-08 08:04, Peter Zijlstra wrote:
On Thu, Nov 03, 2022 at 04:03:43PM -0400, Mathieu Desnoyers wrote:

The credit goes to Paul Turner (Google) for the vcpu_id idea. This
feature is implemented based on the discussions with Paul Turner and
Peter Oskolkov (Google), but I took the liberty to implement scheduler
fast-path optimizations and my own NUMA-awareness scheme. The rumor has
it that Google have been running a rseq vcpu_id extension internally at
Google in production for a year. The tcmalloc source code indeed has
comments hinting at a vcpu_id prototype extension to the rseq system
call [1].

Re NUMA thing -- that means that on a 512 node system a single threaded
task can still observe 512 separate vcpu-ids, right?

Yes, that's correct.


Also, said space won't be dense.

Indeed, this can be inefficient if the data structure within the single-threaded task is not NUMA-aware *and* that task is free to bounce all over the 512 numa nodes.


The main selling point of the whole vcpu-id scheme was that the id space
is dense and not larger than min(nr_cpus, nr_threads), which then gives
useful properties.

But I'm not at all seeing how the NUMA thing preserves that.

If a userspace per-vcpu data structure is implemented with NUMA-local allocations, then it becomes really interesting to guarantee that the per-vcpu-id accesses are always numa-local for performance reasons.

If a userspace per-vcpu data structure is not numa-aware, then we have two scenarios:

A) The cpuset/sched affinity under which it runs pins it to a set of cores belonging to a specific NUMA node. In this case, even with numa-aware vcpu id allocation, the ids will stay as close to 0 as if not numa-aware.

B) No specific cpuset/sched affinity set, which means the task is free to bounce all over. In this case I agree that having the indexing numa-aware, but the per-vcpu data structure not numa-aware, is inefficient.

I wonder whether scenarios with 512 nodes systems, with containers using few cores, but without using cpusets/sched affinity to pin the workload to specific numa nodes is a workload we should optimize for ? It looks like the lack of numa locality due to lack of allowed cores restriction is a userspace configuration issue.

We also must keep in mind that we can expect a single task to load a mix of executable/shared libraries where some pieces may be numa-aware, and others may not. This means we should ideally support a numa-aware vcpu-id allocation scheme and non-numa-aware vcpu-id allocation scheme within the same task.

This could be achieved by exposing two struct rseq fields rather than one, e.g.:

vm_vcpu_id -> flat indexing, not numa-aware.
vm_numa_vcpu_id -> numa-aware vcpu id indexing.

This would allow data structures that are inherently numa-aware to benefit from numa-locality, without hurting non-numa-aware data structures.


Also; given the utter mind-bendiness of the NUMA thing; should it go
into it's own patch; introduce the regular plain old vcpu first, and
then add things to it -- that also allows pushing those weird cpumask
ops you've created later into the series.

Good idea. I can do that once we agree on the way forward for flat vs numa-aware vcpu-id rseq fields.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com