RE: [PATCH 1/2] x86/CPU/AMD: Present package as die instead of socket

From: Ghannam, Yazen
Date: Tue Jun 27 2017 - 14:32:48 EST


> -----Original Message-----
> From: linux-kernel-owner@xxxxxxxxxxxxxxx [mailto:linux-kernel-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Borislav Petkov
> Sent: Tuesday, June 27, 2017 1:44 PM
> To: Suthikulpanit, Suravee <Suravee.Suthikulpanit@xxxxxxx>
> Cc: x86@xxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Duran, Leo
> <leo.duran@xxxxxxx>; Ghannam, Yazen <Yazen.Ghannam@xxxxxxx>;
> Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Subject: Re: [PATCH 1/2] x86/CPU/AMD: Present package as die instead of
> socket
>
> On Tue, Jun 27, 2017 at 11:54:12PM +0700, Suravee Suthikulpanit wrote:
> > The 8 threads sharing each L3 are already in the same sched-domain1
> > (MC CCX). So, cpu0 is in the same sched-domain1 as
> > cpu1,2,3,64,65,66,67. Here, we need the DIE sched-domain because it
> > represents all cpus that are in the same NUMA node (since we have one
> memory controller per DIE).
>
> So this is still confusing. Please drop the "DIE sched-domain" as that is
> something you're trying to define and I'm trying to parse what you're trying to
> define and why.
>
> > IIUC, for Zen, w/o the DIE sched-domain, the scheduler could try to
> > re-balance the tasks from one CCX (schedule group) to another CCX
> > across NUMA node, and
>
> CCX, schedule group, NUMA node, ... now my head is spinning. Do you see
> what I mean with agreeing on the nomenclature and proper term definitions
> first?
>
> > potentially causing unnecessary performance due to remote memory
> access.
> >
> > Please note also that SRAT/SLIT information are used to derive the
> > NUMA sched-domains, while the DIE sched-domain is non-NUMA
> > sched-domain (derived from CPUID topology extension which is available on
> newer families).
>
> So let's try to discuss this without using DIE sched-domain, CCX, etc, and let's
> start simple.
>
> So in that die graphic:
>
> ----------------------------
> C0 | T0 T1 | || | T0 T1 | C4
> --------| || |--------
> C1 | T0 T1 | L3 || L3 | T0 T1 | C5
> --------| || |--------
> C2 | T0 T1 | #0 || #1 | T0 T1 | C6
> --------| || |--------
> C3 | T0 T1 | || | T0 T1 | C7
> ----------------------------
>
> you want all those threads to belong to a single scheduling group.
> Correct?
>
> Now that thing has a memory controller attached to it, correct?
>
> If so, why is this thing not a logical NUMA node, as described in SRAT/SLIT?
>
> If not, what does a NUMA node entail on Zen as described by SRAT/SLIT?
> I.e., what is the difference between the two things? I.e., how many dies as
> above are in a NUMA node?
>
> Now, SRAT should contain the assignment which core belongs to which node.
> Why is that not sufficient?
>
> Ok, that should be enough questions for now. Let's start with them.
>

This group is a NUMA node. It is the "identity" NUMA node. Linux skips the
identity NUMA node when finding the NUMA levels. This is fine as long as the
MC domain is equivalent to the identity NUMA node. However, this is not the
case on Zen systems.

We could patch the sched/topology.c to not skip the identity NUMA node.
Though this will affect all systems not just AMD.

Something like this:

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1b0b4fb..98d856c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1103,6 +1103,8 @@ void sched_init_numa(void)
* node_distance(i,j) in order to avoid cubic time.
*/
next_distance = curr_distance;
+ sched_domains_numa_distance[level++] = next_distance;
+ sched_domains_numa_levels = level;
for (i = 0; i < nr_node_ids; i++) {
for (j = 0; j < nr_node_ids; j++) {
for (k = 0; k < nr_node_ids; k++) {

Thanks,
Yazen