Re: rcutorture: meaning of "End of test: RCU_HOTPLUG"

From: Paul E. McKenney
Date: Tue Jan 22 2019 - 22:23:05 EST


On Tue, Jan 22, 2019 at 04:42:19PM +0800, Su Yue wrote:
> Thanks for your quick reply! Paul
>
> On 1/22/19 12:01 PM, Paul E. McKenney wrote:
> >On Tue, Jan 22, 2019 at 11:40:53AM +0800, Su Yue wrote:
> >>Hi, guys
> >> While running rcutorture tests with "onoff_interval", some tests
> >>failed and results show like:
> >>
> >>=====================================================================
> >>[ 316.354501] srcud-torture:--- End of test: RCU_HOTPLUG:
> >>nreaders=1 nfakewriters=4 stat_interval=60 verbose=2
> >>test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fq\
> >>s_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0
> >>test_boost_interval=7 test_boost_duration=4 shutdown_secs=0
> >>stall_cpu=0 stall_cpu_holdoff=10 stall_cpu_irqsoff=0 n_ba\
> >>rrier_cbs=0 onoff_interval=3 onoff_holdoff=0
> >>====================================================================
> >>
> >>I am wondering that meaning of "RCU_HOTPLUG". Is it expected because
> >>cpu hotplug is enabled in the test? Or just represents another type of
> >>failure?
> >
> >This says that at least one CPU hotplug operation failed, that is,
> >the CPU didn't actually come online or go offline as requested. If you
> >are introducing CPU hotplug to an architecture, this usually indicates
> >that you have bugs in your CPU-hotplug code. Or it nmight be that
>
> It should hit the case since there is no RCU CPU stall warnings.
>
> >RCU grace periods failed to progress -- though this would normally
> >also result in RCU CPU stall warnings.
> >
> >There should be lines containing "ver:" in your console output. What
> >does one of the later one of these say?
> >
>
> The line says:
> ======================================================================
> [ 318.850175] busted_srcud-torture: rtc: (null) ver:
> 27040 tfle: 0 rta: 27040 rtaf: 0 rtf: 27027 rtmbe: 0 rtbe: 0 rtbke:
> 0 rtbre: 0 rtbf: 0 rtb: 0 \
> nt: 9497 onoff: 2639/2639:2640/5310 40,373:10,355 162868:67542
> (HZ=1000) barrier: 0/0:0

Yes, you have many more offline attempts than successes, which is
why RCU_HOTPLUG was printed.

> =====================================================================
>
> And here are useful errors:
> =====================================================================
> kern :info : [ 135.379693] KVM setup async PF for cpu 1
> kern :info : [ 135.381412] kvm-stealtime: cpu 1, msr 23fd16180
> kern :alert : [ 135.386897] busted_srcud-torture:torture_onoff

Just so your know, busted_srcud can sometimes fail by design. Hence
the "busted" in the name. But failure didn't happen this time.

> task: onlined 1
> kern :alert : [ 135.408241] busted_srcud-torture:torture_onoff
> task: offlining 1
> kern :info : [ 135.423310] Unregister pv shared memory for cpu 1
> kern :info : [ 135.427940] smpboot: CPU 1 is now offline
> kern :alert : [ 135.430106] busted_srcud-torture:torture_onoff
> task: offlined 1
> kern :alert : [ 135.436404] busted_srcud-torture:torture_onoff
> task: offlining 0
> kern :alert : [ 135.446173] busted_srcud-torture:torture_onoff
> task: offline 0 failed: errno -16
> kern :alert : [ 135.453076] busted_srcud-torture:torture_onoff
> task: offlining 0
> kern :alert : [ 135.457461] busted_srcud-torture:torture_onoff
> task: offline 0 failed: errno -16
>
>
> =====================================================================
> There are only two CPUs on the VM. Torture try to offline the last one
> but -EBUSY occured.
>
> I spent time to understand kernel/torture.c.
> There is torture_onoff():
>
> 225 while (!torture_must_stop()) {
> 226 cpu = (torture_random(&rand) >> 4) % (maxcpu + 1);
> 227 if (!torture_offline(cpu,
> 228 &n_offline_attempts,
> &n_offline_successes,
> 229 &sum_offline, &min_offline,
> &max_offline))
> 230 torture_online(cpu,
> 231 &n_online_attempts,
> &n_online_successes,
> 232 &sum_online, &min_online,
> &max_online);
> 233 schedule_timeout_interruptible(onoff_interval);
> 234 }
> 235
>
> torture_offline() and torture_offline() don't pre judge if the current
> cpu is only one usable.

That does appear to be the case, and that would be a problem with
the CONFIG_BOOTPARAM_HOTPLUG_CPU0 listed below.

Good catch!

> Our test machines are configured with CONFIG_BOOTPARAM_HOTPLUG_CPU0. If
> there are only one oneline and hotplugable cpux, then
> n_offline_successes != n_offline_attempts which caused "End of test:
> RCU_HOTPLUG".
>
> Does I misunderstand something above? Feel free to correct me.

Does the following patch help?

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/torture.c b/kernel/torture.c
index a03ff722352b..2b6700ca2a43 100644
--- a/kernel/torture.c
+++ b/kernel/torture.c
@@ -101,6 +101,8 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,

if (!cpu_online(cpu) || !cpu_is_hotpluggable(cpu))
return false;
+ if (num_online_cpus() <= 1)
+ return false; /* Can't offline the last CPU. */

if (verbose > 1)
pr_alert("%s" TORTURE_FLAG