Re: [PATCH v3] RISC-V: Probe misaligned access speed in parallel

From: Evan Green
Date: Tue Nov 07 2023 - 12:26:46 EST


On Tue, Nov 7, 2023 at 12:34 AM Sebastian Andrzej Siewior
<bigeasy@xxxxxxxxxxxxx> wrote:
>
> On 2023-11-06 14:58:55 [-0800], Evan Green wrote:
> > Probing for misaligned access speed takes about 0.06 seconds. On a
> > system with 64 cores, doing this in smp_callin() means it's done
> > serially, extending boot time by 3.8 seconds. That's a lot of boot time.
> >
> > Instead of measuring each CPU serially, let's do the measurements on
> > all CPUs in parallel. If we disable preemption on all CPUs, the
> > jiffies stop ticking, so we can do this in stages of 1) everybody
> > except core 0, then 2) core 0. The allocations are all done outside of
> > on_each_cpu() to avoid calling alloc_pages() with interrupts disabled.
> >
> > For hotplugged CPUs that come in after the boot time measurement,
> > register CPU hotplug callbacks, and do the measurement there. Interrupts
> > are enabled in those callbacks, so they're fine to do alloc_pages() in.
>
> I think this is dragged out of proportion. I would do this (if needed
> can can't be identified by CPU-ID or so) on boot CPU only. If there is
> evidence/ proof/ blessing from the high RiscV council that different
> types of CPU cores are mixed together then this could be extended.
> You brought Big-Little up in the other thread. This is actually known.
> Same as with hyper-threads on x86, you know which CPU is the core and
> which hyper thread (CPU) belongs to it.
> So in terms of BigLittle you _could_ limit this to one Big and one
> Little core instead running it on all.

Doing it on one per cluster might also happen to work, but I still see
nothing that prevents variety within a cluster, so I'm not comfortable
with that assumption. It also doesn't buy much. I'm not sure what kind
of guidance RVI is providing on integrating multiple CPUs into a
system. I haven't seen any myself, but am happy to reassess if there's
documentation banning the scenarios I'm imagining.

>
> But this is just my few on this. From PREEMPT_RT's point of view, the
> way you restructured the memory allocation should work now.

Thanks!

>
> > Reported-by: Jisheng Zhang <jszhang@xxxxxxxxxx>
> > Closes: https://lore.kernel.org/all/mhng-9359993d-6872-4134-83ce-c97debe1cf9a@palmer-ri-x1c9/T/#mae9b8f40016f9df428829d33360144dc5026bcbf
> > Fixes: 584ea6564bca ("RISC-V: Probe for unaligned access speed")
> > Signed-off-by: Evan Green <evan@xxxxxxxxxxxx>
> >
> >
> > diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> > index 6a01ded615cd..fe59e18dbd5b 100644
> > --- a/arch/riscv/kernel/cpufeature.c
> > +++ b/arch/riscv/kernel/cpufeature.c
>
> >
> > -static int __init check_unaligned_access_boot_cpu(void)
> > +/* Measure unaligned access on all CPUs present at boot in parallel. */
> > +static int check_unaligned_access_all_cpus(void)
> > {
> > - check_unaligned_access(0);
> > + unsigned int cpu;
> > + unsigned int cpu_count = num_possible_cpus();
> > + struct page **bufs = kzalloc(cpu_count * sizeof(struct page *),
> > + GFP_KERNEL);
>
> kcalloc(). For beauty reasons you could try a reverse xmas tree.
>
> > +
> > + if (!bufs) {
> > + pr_warn("Allocation failure, not measuring misaligned performance\n");
> > + return 0;
> > + }
> > +
> > + /*
> > + * Allocate separate buffers for each CPU so there's no fighting over
> > + * cache lines.
> > + */
> > + for_each_cpu(cpu, cpu_online_mask) {
> > + bufs[cpu] = alloc_pages(GFP_KERNEL, MISALIGNED_BUFFER_ORDER);
> > + if (!bufs[cpu]) {
> > + pr_warn("Allocation failure, not measuring misaligned performance\n");
> > + goto out;
> > + }
> > + }
> > +
> > + /* Check everybody except 0, who stays behind to tend jiffies. */
> > + on_each_cpu(check_unaligned_access_nonboot_cpu, bufs, 1);
>
> comments! _HOW_ do you ensure that CPU0 is left out? You don't. CPU0
> does this and the leaves which is a waste. Using on_each_cpu_cond()
> could deal with this. And you have the check within the wrapper
> (check_unaligned_access_nonboot_cpu()) anyway.
>
> > + /* Check core 0. */
> > + smp_call_on_cpu(0, check_unaligned_access, bufs[0], true);
>
> Now that comment is obvious. If you want to add a comment, why not state
> why CPU0 has to be done last?
>
> > +
> > + /* Setup hotplug callback for any new CPUs that come online. */
> > + cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "riscv:online",
> > + riscv_online_cpu, NULL);
> Instead riscv:online you could use riscv:unaliged_check or something
> that pin points the callback to something obvious. This is exported via
> sysfs.
>
> Again, comment is obvious. For that to make sense would require RiscV to
> support physical-hotplug. For KVM like environment (where you can plug in
> CPUs later) this probably doesn't make sense at all. Why not? Because
>
> - without explicit CPU pinning your slow/ fast CPU mapping (host <->
> guest) could change if the scheduler on the host moves the threads
> around.

Taking a system with non-identical cores and allowing vcpus to bounce
between them sounds like a hypervisor configuration issue to me,
regardless of this patch.

>
> - without explicit task offload and resource partitioning on the host
> your guest thread might get interrupt during the measurement. This is
> done during boot so chances are high that it runs 100% of its time
> slice and will be preempted once other tasks on the host ask for CPU
> run time.

The measurement takes the best (lowest time) iteration. So unless
every iteration gets interrupted, I should get a good read in there
somewhere.
-Evan