Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API

From: Abel Wu
Date: Mon Jun 20 2022 - 10:23:37 EST

Next message: Greg Kroah-Hartman: "Re: [PATCH] uacce: fix concurrency of fops_open and uacce_remove"
Previous message: Greg Kroah-Hartman: "[PATCH 5.10 26/84] virtio-mmio: fix missing put_device() when vm_cmdline_parent registration failed"
In reply to: K Prateek Nayak: "Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API"
Next in thread: Yicong Yang: "Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 6/20/22 7:20 PM, K Prateek Nayak Wrote:

Hello Tim,

Thank you for looking into this.

On 6/17/2022 10:20 PM, Tim Chen wrote:

On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote:

--
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e9f3dc6dcbf4..97a3895416ab 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
return sd;
}
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
+DECLARE_PER_CPU(int, sd_share_id);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DECLARE_PER_CPU(int, sd_llc_size);
DECLARE_PER_CPU(int, sd_llc_id);
-DECLARE_PER_CPU(int, sd_share_id);
DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
-DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
--

The System-map of each kernel is as follows:

- On "tip"

0000000000020518 D sd_asym_cpucapacity
0000000000020520 D sd_asym_packing
0000000000020528 D sd_numa
0000000000020530 D sd_llc_shared
0000000000020538 D sd_llc_id
000000000002053c D sd_llc_size
-------------------------------------------- 64B Cacheline Boundary
0000000000020540 D sd_llc

- On "tip + Patch 1 only" and "tip + both patches"

0000000000020518 D sd_asym_cpucapacity
0000000000020520 D sd_asym_packing
0000000000020528 D sd_numa
0000000000020530 D sd_cluster <-----
0000000000020538 D sd_llc_shared
-------------------------------------------- 64B Cacheline Boundary
0000000000020540 D sd_share_id <-----
0000000000020544 D sd_llc_id
0000000000020548 D sd_llc_size
0000000000020550 D sd_llc

- On "tip + both patches (Move declaration to top)"

0000000000020518 D sd_asym_cpucapacity
0000000000020520 D sd_asym_packing
0000000000020528 D sd_numa
0000000000020530 D sd_llc_shared
0000000000020538 D sd_llc_id
000000000002053c D sd_llc_size
-------------------------------------------- 64B Cacheline Boundary
0000000000020540 D sd_llc

Wonder if it will help to try keep sd_llc and sd_llc_size into the same
cache line. They are both used in the wake up path.

We are still evaluating keeping which set of variables on the same
cache line will provide the best results.

I would have expected the two kernel variants - "tip" and the
"tip + both patches (Move declaration to top)" - to give similar results
as their System map for all the old variables remain the same and the
addition of "sd_share_id" and "sd_cluster: fit in the gap after "sd_llc".
However, now we see a regression for higher number of client.

This probably hints that access to "sd_cluster" variable in Patch 2 and
bringing in the extra cache line could be responsible for the regression
we see with "tip + both patches (Move declaration to top)"

0000000000020548 D sd_share_id <-----
0000000000020550 D sd_cluster <-----

Or change the layout a bit to see if there's any difference,
like:

DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_size);
DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DEFINE_PER_CPU(int, sd_share_id);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);

I need to further look into it and have some tests on a SMT machine. Would you mind to share
the kernel config as well? I'd like to compare the config as well.

I've attached the kernel config used to build the test kernel
to this mail.

Thanks,
Yicong

We are trying to debug the issue using perf and find an optimal
arrangement of the per cpu declarations to get the relevant data
used in the wakeup path on the same 64B cache line.

A check of perf c2c profile difference between tip and the move new declarations to
the top case could be useful. It may give some additional clues of possibel
false sharing issues.

Thank you for the suggestion. We are currently looking at perf counter
data to see how the cache efficiency has changed between the two kernels.
We suspect that the need for the data in the other cache line too in the
wakeup path is resulting in higher cache misses in the levels closer to
the core.

I don't think it is a false sharing problem as these per CPU data are
set when the system first boots up and will only be change again during
a CPU hotplug operation. However, it might be beneficial to see the c2c
profile if perf counters don't indicate anything out of the ordinary.

Would it be possible if any other frequent-written variables share
same cacheline with these per-cpu data causing false sharing? What
about making all these sd_* data DEFINE_PER_CPU_READ_MOSTLY?

Next message: Greg Kroah-Hartman: "Re: [PATCH] uacce: fix concurrency of fops_open and uacce_remove"
Previous message: Greg Kroah-Hartman: "[PATCH 5.10 26/84] virtio-mmio: fix missing put_device() when vm_cmdline_parent registration failed"
In reply to: K Prateek Nayak: "Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API"
Next in thread: Yicong Yang: "Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]