Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

From: Daniel Jordan
Date: Wed May 03 2023 - 15:42:10 EST

Next message: Suren Baghdasaryan: "Re: [PATCH 2/3] mm: drop VMA lock before waiting for migration"
Previous message: Suren Baghdasaryan: "Re: [PATCH 00/40] Memory allocation profiling"
Next in thread: Aaron Lu: "Re: [RFC PATCH] sched/fair: Make tg->load_avg per node"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Apr 21, 2023 at 11:05:59PM +0800, Aaron Lu wrote:
> On Thu, Apr 20, 2023 at 04:52:01PM -0400, Daniel Jordan wrote:
> > AMD EPYC 7J13 64-Core Processor (NPS1)
> > 2 sockets * 64 cores * 2 threads = 256 CPUs
> >
> > update_load_avg profile% update_cfs_group profile%
> > affinity nr_threads base test diff base test diff
> > unbound 96 0.7 0.6 -0.1 0.3 0.6 0.4
> > unbound 128 0.8 0.7 0.0 0.3 0.7 0.4
> > unbound 160 2.4 1.7 -0.7 1.2 2.3 1.1
> > unbound 192 2.3 1.7 -0.6 0.9 2.4 1.5
> > unbound 224 0.9 0.9 0.0 0.3 0.6 0.3
> > unbound 256 0.4 0.4 0.0 0.1 0.2 0.1
>
> Is it possible to show per-node profile for the two functions? I wonder
> how the per-node profile changes with and without this patch on Milan.
> And for vanilla kernel, it would be good to know on which node the struct
> task_group is allocated. I used below script to fetch this info:
> kretfunc:sched_create_group
> {
> $root = kaddr("root_task_group");
> if (args->parent == $root) {
> return;
> }
>
> printf("cpu%d, node%d: tg=0x%lx, parent=%s\n", cpu, numaid,
> retval, str(args->parent->css.cgroup->kn->name));
> }

That's helpful, nid below comes from this. The node happened to be different
between base and test kernels on both machines, so that's one less way the
experiment is controlled but for the unbound case where tasks are presumably
spread fairly evenly I'm not sure how much it matters, especially given that
the per-node profile numbers are fairly close to each other.

Data below, same parameters and times as the last mail.

> BTW, is the score(transactions) of the workload stable? If so, how the
> score change when the patch is applied?

Transactions seem to be mostly stable but unfortunately regress overall on both
machines.

FWIW, t-test compares the two sets of ten iterations apiece. The higher the
percentage, the higher the confidence that the difference is significant.

AMD EPYC 7J13 64-Core Processor (NPS1)
2 sockets * 64 cores * 2 threads = 256 CPUs

transactions per second

diff base test
----------------- ------------------ ------------------
tps tps
affinity nr_threads (%diff) (t-test) tps std% nid tps std% nid
unbound 96 -0.8% 100% 128,450 0% 1 127,433 0% 0
unbound 128 -1.0% 100% 138,471 0% 1 137,099 0% 0
unbound 160 -1.2% 100% 136,829 0% 1 135,170 0% 0
unbound 192 0.4% 95% 152,767 0% 1 153,336 0% 0
unbound 224 -0.2% 81% 179,946 0% 1 179,620 0% 0
unbound 256 -0.2% 71% 203,920 0% 1 203,583 0% 0
node0 48 0.1% 46% 69,635 0% 0 69,719 0% 0
node0 64 -0.1% 69% 75,213 0% 0 75,163 0% 0
node0 80 -0.4% 100% 72,520 0% 0 72,217 0% 0
node0 96 -0.2% 89% 81,345 0% 0 81,210 0% 0
node0 112 -0.3% 98% 96,174 0% 0 95,855 0% 0
node0 128 -0.7% 100% 111,813 0% 0 111,045 0% 0
node1 48 0.3% 78% 69,985 1% 1 70,200 1% 1
node1 64 0.6% 100% 75,770 0% 1 76,231 0% 1
node1 80 0.3% 100% 73,329 0% 1 73,567 0% 1
node1 96 0.4% 99% 82,222 0% 1 82,556 0% 1
node1 112 0.1% 62% 96,573 0% 1 96,689 0% 1
node1 128 -0.2% 69% 111,614 0% 1 111,435 0% 1

update_load_avg profile%

all_nodes node0 node1
---------------- ---------------- ----------------
affinity nr_threads base test diff base test diff base test diff
unbound 96 0.7 0.6 -0.1 0.7 0.6 -0.1 0.7 0.6 -0.1
unbound 128 0.8 0.7 -0.1 0.8 0.7 -0.1 0.8 0.7 -0.1
unbound 160 2.3 1.7 -0.7 2.5 1.7 -0.8 2.2 1.6 -0.5
unbound 192 2.2 1.6 -0.6 2.5 1.8 -0.7 2.0 1.4 -0.6
unbound 224 0.9 0.8 -0.1 1.1 0.7 -0.3 0.8 0.8 0.0
unbound 256 0.4 0.4 0.0 0.4 0.4 0.0 0.4 0.4 0.0
node0 48 0.7 0.6 -0.1
node0 64 0.8 0.7 -0.2
node0 80 2.0 1.4 -0.7
node0 96 2.3 1.4 -0.9
node0 112 1.0 0.8 -0.2
node0 128 0.5 0.4 0.0
node1 48 0.7 0.6 -0.1
node1 64 0.8 0.6 -0.1
node1 80 1.4 1.2 -0.2
node1 96 1.5 1.3 -0.2
node1 112 0.8 0.7 -0.1
node1 128 0.4 0.4 -0.1

update_cfs_group profile%

all_nodes node0 node1
---------------- ---------------- ----------------
affinity nr_threads base test diff base test diff base test diff
unbound 96 0.3 0.6 0.3 0.3 0.6 0.3 0.3 0.6 0.3
unbound 128 0.3 0.6 0.3 0.3 0.6 0.3 0.3 0.7 0.4
unbound 160 1.1 2.5 1.4 1.3 2.2 0.9 0.9 2.8 1.9
unbound 192 0.9 2.6 1.7 1.1 2.4 1.3 0.7 2.8 2.1
unbound 224 0.3 0.8 0.5 0.4 0.6 0.3 0.2 0.9 0.6
unbound 256 0.1 0.2 0.1 0.1 0.2 0.1 0.1 0.2 0.1
node0 48 0.4 0.6 0.2
node0 64 0.3 0.6 0.3
node0 80 0.7 0.6 -0.1
node0 96 0.6 0.6 0.0
node0 112 0.3 0.4 0.1
node0 128 0.1 0.2 0.1
node1 48 0.3 0.6 0.3
node1 64 0.3 0.6 0.3
node1 80 0.3 0.6 0.3
node1 96 0.3 0.6 0.3
node1 112 0.2 0.3 0.2
node1 128 0.1 0.2 0.1

Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
2 sockets * 32 cores * 2 thread = 128 CPUs

transactions per second

diff base test
----------------- ------------------ ------------------
tps tps
affinity nr_threads (%diff) (t-test) tps std% nid tps std% nid
unbound 48 -0.9% 100% 75,500 0% 1 74,834 0% 0
unbound 64 -0.4% 100% 81,687 0% 1 81,368 0% 0
unbound 80 -0.4% 100% 78,620 0% 1 78,281 0% 0
unbound 96 -0.5% 74% 78,949 1% 1 78,580 1% 0
unbound 112 -2.9% 87% 94,189 3% 1 91,458 5% 0
unbound 128 -1.4% 100% 117,557 0% 1 115,921 0% 0
node0 24 -0.7% 100% 38,601 0% 0 38,333 0% 0
node0 32 -1.2% 100% 41,539 0% 0 41,038 0% 0
node0 40 -1.6% 100% 42,325 0% 0 41,662 0% 0
node0 48 -1.3% 100% 41,956 0% 0 41,404 0% 0
node0 56 -1.3% 100% 42,115 0% 0 41,569 0% 0
node0 64 -1.0% 100% 62,431 0% 0 61,784 0% 0
node1 24 0.0% 1% 38,752 0% 1 38,752 0% 1
node1 32 0.9% 100% 42,568 0% 1 42,943 0% 1
node1 40 -0.2% 87% 43,452 0% 1 43,358 0% 1
node1 48 -0.5% 100% 43,047 0% 1 42,831 0% 1
node1 56 -0.5% 100% 43,464 0% 1 43,259 0% 1
node1 64 0.5% 100% 64,111 0% 1 64,450 0% 1

update_load_avg profile%

all_nodes node0 node1
---------------- ---------------- ----------------
affinity nr_threads base test diff base test diff base test diff
unbound 48 0.5 0.5 0.0 0.5 0.5 0.0 0.4 0.5 0.0
unbound 64 0.5 0.5 0.0 0.5 0.5 0.0 0.5 0.5 0.0
unbound 80 2.0 1.8 -0.3 2.0 1.7 -0.3 2.0 1.8 -0.2
unbound 96 3.4 2.8 -0.6 3.4 2.8 -0.6 3.4 2.9 -0.5
unbound 112 2.5 2.3 -0.1 4.5 3.8 -0.8 0.5 0.9 0.5
unbound 128 0.4 0.5 0.0 0.4 0.4 0.0 0.5 0.5 0.1
node0 24 0.4 0.5 0.0
node0 32 0.5 0.5 0.0
node0 40 1.0 1.1 0.1
node0 48 1.5 1.6 0.1
node0 56 1.8 1.9 0.1
node0 64 0.4 0.4 0.0
node1 24 0.5 0.4 0.0
node1 32 0.5 0.4 0.0
node1 40 1.0 1.1 0.0
node1 48 1.6 1.6 0.1
node1 56 1.9 1.9 0.0
node1 64 0.4 0.4 -0.1

update_cfs_group profile%

all_nodes node0 node1
---------------- ---------------- ----------------
affinity nr_threads base test diff base test diff base test diff
unbound 48 0.3 0.5 0.2 0.3 0.5 0.2 0.3 0.5 0.2
unbound 64 0.5 0.6 0.1 0.5 0.6 0.1 0.5 0.6 0.1
unbound 80 2.8 2.5 -0.3 2.6 2.4 -0.2 2.9 2.5 -0.5
unbound 96 3.7 3.3 -0.4 3.5 3.3 -0.2 3.9 3.3 -0.6
unbound 112 4.2 3.2 -1.0 4.1 3.3 -0.7 4.4 3.1 -1.2
unbound 128 0.4 0.5 0.1 0.4 0.5 0.1 0.4 0.5 0.1
node0 24 0.3 0.5 0.2
node0 32 0.3 0.4 0.1
node0 40 0.7 0.8 0.1
node0 48 0.8 0.9 0.1
node0 56 0.8 0.9 0.1
node0 64 0.2 0.4 0.1
node1 24 0.3 0.5 0.2
node1 32 0.3 0.5 0.2
node1 40 0.8 0.9 0.1
node1 48 0.8 0.9 0.1
node1 56 0.9 0.9 0.1
node1 64 0.2 0.4 0.1

There doesn't seem to be much of a pattern in the per-node breakdown.
Sometimes there's a bit more overhead on the node remote to the task_group
allocation than the node local to it, like I'd expect, and sometimes it's the
opposite. Generally pretty even.

> > node0 48 0.7 0.6 -0.1 0.3 0.6 0.3
> > node0 64 0.7 0.7 -0.1 0.3 0.6 0.3
> > node0 80 1.4 1.3 -0.1 0.3 0.6 0.3
> > node0 96 1.5 1.4 -0.1 0.3 0.6 0.3
> > node0 112 0.8 0.8 0.0 0.2 0.4 0.2
> > node0 128 0.4 0.4 0.0 0.1 0.2 0.1
> > node1 48 0.7 0.6 -0.1 0.3 0.6 0.3
> > node1 64 0.7 0.6 -0.1 0.3 0.6 0.3
> > node1 80 1.4 1.2 -0.1 0.3 0.6 0.3
> > node1 96 1.4 1.3 -0.2 0.3 0.6 0.3
> > node1 112 0.8 0.7 -0.1 0.2 0.3 0.2
> > node1 128 0.4 0.4 0.0 0.1 0.2 0.1
>
> I can see why the cost of update_cfs_group() slightly increased since
> now there is no cross node access to tg->load_avg and the patched kernel
> doesn't provide any benefit but only incur some overhead due to indirect
> access to tg->load_avg, but why update_load_avg()'s cost dropped? I
> expect it to be roughly the same after patched or slightly increased.

Yeah, that's not immediately obvious, especially when the Intel machine doesn't
do this.

Next message: Suren Baghdasaryan: "Re: [PATCH 2/3] mm: drop VMA lock before waiting for migration"
Previous message: Suren Baghdasaryan: "Re: [PATCH 00/40] Memory allocation profiling"
Next in thread: Aaron Lu: "Re: [RFC PATCH] sched/fair: Make tg->load_avg per node"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]