Re: rcutorture: meaning of "End of test: RCU_HOTPLUG"

From: Su Yue
Date: Thu Jan 24 2019 - 01:52:08 EST




On 1/23/19 11:22 AM, Paul E. McKenney wrote:
On Tue, Jan 22, 2019 at 04:42:19PM +0800, Su Yue wrote:
Thanks for your quick reply! Paul

On 1/22/19 12:01 PM, Paul E. McKenney wrote:
On Tue, Jan 22, 2019 at 11:40:53AM +0800, Su Yue wrote:
Hi, guys
While running rcutorture tests with "onoff_interval", some tests
failed and results show like:

=====================================================================
[ 316.354501] srcud-torture:--- End of test: RCU_HOTPLUG:
nreaders=1 nfakewriters=4 stat_interval=60 verbose=2
test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fq\
s_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0
test_boost_interval=7 test_boost_duration=4 shutdown_secs=0
stall_cpu=0 stall_cpu_holdoff=10 stall_cpu_irqsoff=0 n_ba\
rrier_cbs=0 onoff_interval=3 onoff_holdoff=0
====================================================================

I am wondering that meaning of "RCU_HOTPLUG". Is it expected because
cpu hotplug is enabled in the test? Or just represents another type of
failure?

This says that at least one CPU hotplug operation failed, that is,
the CPU didn't actually come online or go offline as requested. If you
are introducing CPU hotplug to an architecture, this usually indicates
that you have bugs in your CPU-hotplug code. Or it nmight be that

It should hit the case since there is no RCU CPU stall warnings.

RCU grace periods failed to progress -- though this would normally
also result in RCU CPU stall warnings.

There should be lines containing "ver:" in your console output. What
does one of the later one of these say?


The line says:
======================================================================
[ 318.850175] busted_srcud-torture: rtc: (null) ver:
27040 tfle: 0 rta: 27040 rtaf: 0 rtf: 27027 rtmbe: 0 rtbe: 0 rtbke:
0 rtbre: 0 rtbf: 0 rtb: 0 \
nt: 9497 onoff: 2639/2639:2640/5310 40,373:10,355 162868:67542
(HZ=1000) barrier: 0/0:0

Yes, you have many more offline attempts than successes, which is
why RCU_HOTPLUG was printed.

=====================================================================

And here are useful errors:
=====================================================================
kern :info : [ 135.379693] KVM setup async PF for cpu 1
kern :info : [ 135.381412] kvm-stealtime: cpu 1, msr 23fd16180
kern :alert : [ 135.386897] busted_srcud-torture:torture_onoff

Just so your know, busted_srcud can sometimes fail by design. Hence
the "busted" in the name. But failure didn't happen this time.


Yes..The corner case I mentioned actually happened in every "onoff"
tests whatever the torture_type is.

task: onlined 1
kern :alert : [ 135.408241] busted_srcud-torture:torture_onoff
task: offlining 1
kern :info : [ 135.423310] Unregister pv shared memory for cpu 1
kern :info : [ 135.427940] smpboot: CPU 1 is now offline
kern :alert : [ 135.430106] busted_srcud-torture:torture_onoff
task: offlined 1
kern :alert : [ 135.436404] busted_srcud-torture:torture_onoff
task: offlining 0
kern :alert : [ 135.446173] busted_srcud-torture:torture_onoff
task: offline 0 failed: errno -16
kern :alert : [ 135.453076] busted_srcud-torture:torture_onoff
task: offlining 0
kern :alert : [ 135.457461] busted_srcud-torture:torture_onoff
task: offline 0 failed: errno -16


=====================================================================
There are only two CPUs on the VM. Torture try to offline the last one
but -EBUSY occured.

I spent time to understand kernel/torture.c.
There is torture_onoff():

225 while (!torture_must_stop()) {
226 cpu = (torture_random(&rand) >> 4) % (maxcpu + 1);
227 if (!torture_offline(cpu,
228 &n_offline_attempts,
&n_offline_successes,
229 &sum_offline, &min_offline,
&max_offline))
230 torture_online(cpu,
231 &n_online_attempts,
&n_online_successes,
232 &sum_online, &min_online,
&max_online);
233 schedule_timeout_interruptible(onoff_interval);
234 }
235

torture_offline() and torture_offline() don't pre judge if the current
cpu is only one usable.

That does appear to be the case, and that would be a problem with
the CONFIG_BOOTPARAM_HOTPLUG_CPU0 listed below.

Good catch!

Our test machines are configured with CONFIG_BOOTPARAM_HOTPLUG_CPU0. If
there are only one oneline and hotplugable cpux, then
n_offline_successes != n_offline_attempts which caused "End of test:
RCU_HOTPLUG".

Does I misunderstand something above? Feel free to correct me.

Does the following patch help?


Yes, no more "errnor: -16" in dmesg and "End of test: SUCCESS" is in
the end.

Thanks for your patch.
If the patch is to be sent in format, you can add:

Tested-By: Su Yue <suy.fnst@xxxxxxxxxxxxxx>


---
Su
Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/torture.c b/kernel/torture.c
index a03ff722352b..2b6700ca2a43 100644
--- a/kernel/torture.c
+++ b/kernel/torture.c
@@ -101,6 +101,8 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
if (!cpu_online(cpu) || !cpu_is_hotpluggable(cpu))
return false;
+ if (num_online_cpus() <= 1)
+ return false; /* Can't offline the last CPU. */
if (verbose > 1)
pr_alert("%s" TORTURE_FLAG