BUG : PowerPC RCU: torture test failed with __stack_chk_fail

From: Zhouyi Zhou
Date: Sat Apr 22 2023 - 08:47:03 EST


Dear PowerPC and RCU developers:
During the RCU torture test on mainline (on the VM of Opensource Lab
of Oregon State University), SRCU-P failed with __stack_chk_fail:
[ 264.381952][ T99] [c000000006c7bab0] [c0000000010c67c0]
dump_stack_lvl+0x94/0xd8 (unreliable)
[ 264.383786][ T99] [c000000006c7bae0] [c00000000014fc94] panic+0x19c/0x468
[ 264.385128][ T99] [c000000006c7bb80] [c0000000010fca24]
__stack_chk_fail+0x24/0x30
[ 264.386610][ T99] [c000000006c7bbe0] [c0000000002293b4]
srcu_gp_start_if_needed+0x5c4/0x5d0
[ 264.388188][ T99] [c000000006c7bc70] [c00000000022f7f4]
srcu_torture_call+0x34/0x50
[ 264.389611][ T99] [c000000006c7bc90] [c00000000022b5e8]
rcu_torture_fwd_prog+0x8c8/0xa60
[ 264.391439][ T99] [c000000006c7be00] [c00000000018e37c] kthread+0x15c/0x170
[ 264.392792][ T99] [c000000006c7be50] [c00000000000df94]
ret_from_kernel_thread+0x5c/0x64
The kernel config file can be found in [1].
And I write a bash script to accelerate the bug reproducing [2].
After a week's debugging, I found the cause of the bug is because the
register r10 used to judge for stack overflow is not constant between
context switches.
The assembly code for srcu_gp_start_if_needed is located at [3]:
c000000000226eb4: 78 6b aa 7d mr r10,r13
c000000000226eb8: 14 42 29 7d add r9,r9,r8
c000000000226ebc: ac 04 00 7c hwsync
c000000000226ec0: 10 00 7b 3b addi r27,r27,16
c000000000226ec4: 14 da 29 7d add r9,r9,r27
c000000000226ec8: a8 48 00 7d ldarx r8,0,r9
c000000000226ecc: 01 00 08 31 addic r8,r8,1
c000000000226ed0: ad 49 00 7d stdcx. r8,0,r9
c000000000226ed4: f4 ff c2 40 bne- c000000000226ec8
<srcu_gp_start_if_needed+0x1c8>
c000000000226ed8: 28 00 21 e9 ld r9,40(r1)
c000000000226edc: 78 0c 4a e9 ld r10,3192(r10)
c000000000226ee0: 79 52 29 7d xor. r9,r9,r10
c000000000226ee4: 00 00 40 39 li r10,0
c000000000226ee8: b8 03 82 40 bne c0000000002272a0
<srcu_gp_start_if_needed+0x5a0>
by debugging, I see the r10 is assigned with r13 on c000000000226eb4,
but if there is a context-switch before c000000000226edc, a false
positive will be reported.

[1] http://154.220.3.115/logs/0422/configformainline.txt
[2] 154.220.3.115/logs/0422/whilebash.sh
[3] http://154.220.3.115/logs/0422/srcu_gp_start_if_needed.txt

My analysis and debugging may not be correct, but the bug is easily
reproducible.

Thanks
Zhouyi