Re: [PATCH 1/2] x86/CPU/AMD: Present package as die instead of socket

From: Suravee Suthikulpanit
Date: Tue Jun 27 2017 - 16:26:34 EST


On 6/28/17 00:44, Borislav Petkov wrote:
So let's try to discuss this without using DIE sched-domain, CCX, etc,
and let's start simple.

So in that die graphic:

----------------------------
C0 | T0 T1 | || | T0 T1 | C4
--------| || |--------
C1 | T0 T1 | L3 || L3 | T0 T1 | C5
--------| || |--------
C2 | T0 T1 | #0 || #1 | T0 T1 | C6
--------| || |--------
C3 | T0 T1 | || | T0 T1 | C7
----------------------------

you want all those threads to belong to a single scheduling group.
Correct?

Actually, let's be a bit more specific here since the meaning of sched-group and sched-domain are different where:

(From: Documentation/scheduler/sched-domains.txt)
---- begin snippet ----
Each scheduling domain must have one or more CPU groups (struct sched_group)
which are organised as a circular one way linked list from the ->groups
pointer. The union of cpumasks of these groups MUST be the same as the
domain's span. The intersection of cpumasks from any two of these groups
MUST be the empty set. The group pointed to by the ->groups pointer MUST
contain the CPU to which the domain belongs. Groups may be shared among
CPUs as they contain read only data after they have been set up.

Balancing within a sched domain occurs between groups. That is, each group
is treated as one entity. The load of a group is defined as the sum of the
load of each of its member CPUs, and only when the load of a group becomes
out of balance are tasks moved between groups.
---- end snippet ----

So, from the definition above, we would like all those 16 threads to be in the same sched-domain, where threads from C0,1,2,3 are in the same sched-group, and threads in C4,5,6,7 are in another sched-group.

Now that thing has a memory controller attached to it, correct?

Yes

If so, why is this thing not a logical NUMA node, as described in
SRAT/SLIT?

Yes, this thing is a logical NUMA node and represented correctly in the SRAT/SLIT.

Now, SRAT should contain the assignment which core belongs to which
node. Why is that not sufficient?

Yes, SRAT provides cpu-to-node mapping, which is sufficient to tell scheduler what are the cpus within a NUMA node.

However, looking at the current sched-domain below. Notice that there is no sched-domain with 16 threads to represent a NUMA node:

cpu0
domain0 00000000,00000001,00000000,00000001 (SMT)
domain1 00000000,0000000f,00000000,0000000f (MC)
domain2 00000000,ffffffff,00000000,ffffffff (NUMA)
domain3 ffffffff,ffffffff,ffffffff,ffffffff (NUMA)

sched-domain2 (which represents a sched-domain containing all cpus within a socket) would have 8 sched-groups (based on the cpumasks from domain1). According to the documentation snippet above regarding balancing within a sched-domain, scheduler will try to do (NUMA) load-balance between 8 groups (spanning 4 NUMA node). Here, IINM, it would be more beneficial if the scheduler would try to load balance between the two groups within the same NUMA node first before, going across NUMA node in order to minimize memory latency. This would require another sched-domain between domain 1 and 2, which represent all 16 threads within a NUMA node (i.e. die sched-domain), this would allow scheduler to load balance within the NUMA node first, before going across NUMA node.

However, since the current code decides that x86_has_numa_in_package is true, it omits the die sched-domain. In order to avoid this, we are proposing to represent cpuinfo_x86.phys_proc_id using NUMA node ID (i.e. die ID). And this is the main point of the patch series.

Thanks,
Suravee