Re: [PATCH 1/5] rcutorture: Fix stuttering races and other issues

From: Joel Fernandes
Date: Thu Jul 27 2023 - 00:03:45 EST


On 7/27/23 00:01, Paul E. McKenney wrote:
On Wed, Jul 26, 2023 at 11:01:40PM -0400, Joel Fernandes wrote:
On Wed, Jul 26, 2023 at 4:59 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:

On Tue, Jul 25, 2023 at 11:29:06PM +0000, Joel Fernandes (Google) wrote:
The stuttering code isn't functioning as expected. Ideally, it should
pause the torture threads for a designated period before resuming. Yet,
it fails to halt the test for the correct duration. Additionally, a race
condition exists, potentially causing the stuttering code to pause for
an extended period if the 'spt' variable is non-zero due to the stutter
orchestration thread's inadequate CPU time.

Moreover, over-stuttering can hinder RCU's progress on TREE07 kernels.
This happens as the stuttering code may run within a softirq due to RCU
callbacks. Consequently, ksoftirqd keeps a CPU busy for several seconds,
thus obstructing RCU's progress. This situation triggers a warning
message in the logs:

[ 2169.481783] rcu_torture_writer: rtort_pipe_count: 9

This warning suggests that an RCU torture object, although invisible to
RCU readers, couldn't make it past the pipe array and be freed -- a
strong indication that there weren't enough grace periods during the
stutter interval.

To address these issues, this patch sets the "stutter end" time to an
absolute point in the future set by the main stutter thread. This is
then used for waiting in stutter_wait(). While the stutter thread still
defines this absolute time, the waiters' waiting logic doesn't rely on
the stutter thread receiving sufficient CPU time to halt the stuttering
as the halting is now self-controlled.

Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
---
kernel/torture.c | 46 +++++++++++++---------------------------------
1 file changed, 13 insertions(+), 33 deletions(-)

diff --git a/kernel/torture.c b/kernel/torture.c
index 68dba4ecab5c..63f8f2a7d960 100644
--- a/kernel/torture.c
+++ b/kernel/torture.c
@@ -719,7 +719,7 @@ static void torture_shutdown_cleanup(void)
* suddenly applied to or removed from the system.
*/
static struct task_struct *stutter_task;
-static int stutter_pause_test;
+static ktime_t stutter_till_abs_time;
static int stutter;
static int stutter_gap;

@@ -729,30 +729,17 @@ static int stutter_gap;
*/
bool stutter_wait(const char *title)
{
- unsigned int i = 0;
bool ret = false;
- int spt;
+ ktime_t now_ns, till_ns;

cond_resched_tasks_rcu_qs();
- spt = READ_ONCE(stutter_pause_test);
- for (; spt; spt = READ_ONCE(stutter_pause_test)) {
- if (!ret && !rt_task(current)) {
- sched_set_normal(current, MAX_NICE);
- ret = true;
- }
- if (spt == 1) {
- torture_hrtimeout_jiffies(1, NULL);
- } else if (spt == 2) {
- while (READ_ONCE(stutter_pause_test)) {
- if (!(i++ & 0xffff))
- torture_hrtimeout_us(10, 0, NULL);
- cond_resched();
- }
- } else {
- torture_hrtimeout_jiffies(round_jiffies_relative(HZ), NULL);
- }
- torture_shutdown_absorb(title);
+ now_ns = ktime_get();
+ till_ns = READ_ONCE(stutter_till_abs_time);
+ if (till_ns && ktime_before(now_ns, till_ns)) {
+ torture_hrtimeout_ns(ktime_sub(till_ns, now_ns), 0, NULL);

This ktime_sub() is roughly cancelled out by a ktime_add_safe() in
__hrtimer_start_range_ns().

Yes, functionally it is the same but your suggestion is more robust I think.

Perhaps torture_hrtimeout_ns() needs to
take a mode argument as in the patch at the end of this email, allowing
you to ditch that ktime_sub() in favor of HRTIMER_MODE_ABS.

Sure, or we can add a new API and keep the default as relative?

Or have 2 APIs:
torture_hrtimeout_relative_ns();

and:
torture_hrtimeout_absolute_ns();

That makes it more readable IMHO.

Also, do you want me to make both changes (API and usage) in the same
patch? Or were you planning to have a separate patch yourself in -dev
which I can use? Let me know either way, and then I'll refresh the
patch.

I queued the patch on the -rcu tree's "dev" branch. It turns out that
torture_hrtimeout_ns() isn't called very many times, so adding the
parameter was straightforward. Plus the compiler might well optimize
it away anyway.

Ok sounds good, I will make use of it in this patch and send it again after testing.

thanks,

- Joel