Re: [BUG] Re: Linux 6.4.4

From: Joel Fernandes
Date: Fri Jul 21 2023 - 08:13:55 EST


On 7/20/23 15:47, Paul E. McKenney wrote:
On Thu, Jul 20, 2023 at 03:32:35PM -0400, Joel Fernandes wrote:
On 7/20/23 15:04, Paul E. McKenney wrote:
On Thu, Jul 20, 2023 at 12:31:13PM -0400, Joel Fernandes wrote:
Hi Paul,

On Thu, Jul 20, 2023 at 11:55 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:

On Thu, Jul 20, 2023 at 01:27:14PM +0000, Joel Fernandes wrote:
[...]

So likely RCU boosting is failing:

The full TREE03 splat:
[ 54.243588] ------------[ cut here ]------------
[ 54.244547] rcu-torture: rcu_torture_boost started
[...]
[ 54.300499] RIP: 0010:rcu_torture_stats_print+0x5b2/0x620
[...]
[ 2169.481783] rcu_torture_writer: rtort_pipe_count: 9

However, if we are to believe the '9', it appears the object did made it
quite some till the end of the pipe array but not until the free pool.

This is from this if/for statement, correct?

stutter_waited = stutter_wait("rcu_torture_writer");
if (stutter_waited &&
!atomic_read(&rcu_fwd_cb_nodelay) &&
!cur_ops->slow_gps &&
!torture_must_stop() &&
boot_ended)
for (i = 0; i < ARRAY_SIZE(rcu_tortures); i++)
if (list_empty(&rcu_tortures[i].rtort_free) &&
rcu_access_pointer(rcu_torture_current) !=
&rcu_tortures[i]) {
tracing_off();
show_rcu_gp_kthreads();
WARN(1, "%s: rtort_pipe_count:
rcu_ftrace_dump(DUMP_ALL);
}

Yes, that's right.

If so, this happens when there was a stutter wait, but RCU grace
periods failed to clear out the backlog during the several seconds that
rcutorture was forced idle. This might be related to the RCU priority
boosting failure, in which a preempted reader persisted across the
stutter interval.

When RCU is operating normally, shouldn't the check
"(list_empty(&rcu_tortures[i].rtort_free)" not run until the preempted
reader unblocks and exits its RCU read-side critical section?

Yes, but not just "until", but rather "long after". If RCU is doing
grace periods correctly, an active reader on a given rcu_tortures[]
element will prevent .rtort_pipe_count from exceeding the value 2.

Ah ok, so the rtort_pipe_count being 9 is a sign RCU isn't making progress
thus making it absent from the free list.

Yes, though RCU is -just- -barely- too slow, as one more grace period
would have done it.

The element will not be put on a list until .rtort_pipe_count is equal
to RCU_TORTURE_PIPE_LEN, which is 10.

This warning usually appears when something is holding up the grace-period
kthread. Historically, this has included deadlocks, missed timers,
and whatever else can prevent the grace-period kthread from running.

Makes sense.

One thing that confuses me, in the case of
"cur_ops->deferred_free(old_rp);" , the earlier do-while loop may exit
before the async callbacks can finish. So what prevents the
"(list_empty(&rcu_tortures[i].rtort_free)" check from happening before
grace periods happen? Thanks for any clarification.

We only enter this code if the stutter_wait() actually waited, and by
default this function will wait about five seconds. Since the rcutorture
testing goes idle during this time period (or is supposed to!), if things
are working properly, knocking off ten grace periods during that time
should be pretty much a given.

Sure, makes sense. And this is not Lazy-RCU so 5 seconds should be plenty
;). I think I was subconsciously expecting an rcu_barrier() somewhere in the
code before those checks, but that's not needed as you pointed that the
stutter should be giving enough time for RCU to make progress.

And there might need to be a call_rcu_hurry() in there somewhere,
now that you mention it. Which would pretty much defeat any sort of
lazy-RCU-callback testing in rcutorture, but testing of laziness might
need to be separate anyway.

So hmm, the count being 9 means that not enough RCU grace periods have
passed for the rcu_torture object in question thus keeping it always
allocated. The GP thread not getting CPU can do that indeed, or perhaps
something else stalling RCU like a preempted reader, length preemption
disabling on a CPU and so forth.. I'll try to collect a trace when it
happens.

Looking forward to seeing what you come up with!

So far I found this. Before the crash, GPs took about 50ms, during the crash it took 5 seconds before the warning which aligns with what you mentioned about stutter.


The GP that never completed is at this line:

[ 2816.041082] <...>-13 13d.... 1237016139us : rcu_grace_period: rcu_sched 144681 start

And fqs loop got a "dti" for CPUs:
1
2
12

And I see cpuqs for:
13
6
10
5
4
11

No idea what happened to 3, 8, 9, 14, 15. Maybe the "dti" for those did not show in the trace?

However, I see that CPU 7 did this:
[ 2816.205277] <idle>-0 7d.... 1237016284us : sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_torture_rea next_pid=149 next_prio=139


and then did this about 3 seconds later:
[ 2819.394111] rcu_tort-149 7dNs.. 1237025699us : rcu_grace_period: rcu_sched 144677 cpuend
[ 2819.412745] rcu_tort-149 7dNs.. 1237025699us : rcu_grace_period: rcu_sched 144681 cpustart


Which means it never switched out from the CPU all this while. Neither did it report cpuqs. Ok it did notice the new GP started, welcome to the party mate ;)

That points the issue I think. I wonder if the emergency provisions for forcing quiescent states on NOHZ_FULL CPUs kicked in. ISTR, we had issues with that in the past where we had to fix the tick path to report a QS.

I'll add some more traces. Especially around preempted readers, the emergency provisions for forcing a QS and so forth and see if I can dig more information.

thanks,

- Joel