Re: BUG : PowerPC RCU: torture test failed with __stack_chk_fail

From: Joel Fernandes
Date: Sat Apr 22 2023 - 15:22:21 EST


Hi Zhouyi,

On Sat, Apr 22, 2023 at 2:47 PM Zhouyi Zhou <zhouzhouyi@xxxxxxxxx> wrote:
>
> Dear PowerPC and RCU developers:
> During the RCU torture test on mainline (on the VM of Opensource Lab
> of Oregon State University), SRCU-P failed with __stack_chk_fail:
> [ 264.381952][ T99] [c000000006c7bab0] [c0000000010c67c0]
> dump_stack_lvl+0x94/0xd8 (unreliable)
> [ 264.383786][ T99] [c000000006c7bae0] [c00000000014fc94] panic+0x19c/0x468
> [ 264.385128][ T99] [c000000006c7bb80] [c0000000010fca24]
> __stack_chk_fail+0x24/0x30
> [ 264.386610][ T99] [c000000006c7bbe0] [c0000000002293b4]
> srcu_gp_start_if_needed+0x5c4/0x5d0
> [ 264.388188][ T99] [c000000006c7bc70] [c00000000022f7f4]
> srcu_torture_call+0x34/0x50
> [ 264.389611][ T99] [c000000006c7bc90] [c00000000022b5e8]
> rcu_torture_fwd_prog+0x8c8/0xa60
> [ 264.391439][ T99] [c000000006c7be00] [c00000000018e37c] kthread+0x15c/0x170
> [ 264.392792][ T99] [c000000006c7be50] [c00000000000df94]
> ret_from_kernel_thread+0x5c/0x64
> The kernel config file can be found in [1].
> And I write a bash script to accelerate the bug reproducing [2].
> After a week's debugging, I found the cause of the bug is because the
> register r10 used to judge for stack overflow is not constant between
> context switches.
> The assembly code for srcu_gp_start_if_needed is located at [3]:
> c000000000226eb4: 78 6b aa 7d mr r10,r13
> c000000000226eb8: 14 42 29 7d add r9,r9,r8
> c000000000226ebc: ac 04 00 7c hwsync
> c000000000226ec0: 10 00 7b 3b addi r27,r27,16
> c000000000226ec4: 14 da 29 7d add r9,r9,r27
> c000000000226ec8: a8 48 00 7d ldarx r8,0,r9
> c000000000226ecc: 01 00 08 31 addic r8,r8,1
> c000000000226ed0: ad 49 00 7d stdcx. r8,0,r9
> c000000000226ed4: f4 ff c2 40 bne- c000000000226ec8
> <srcu_gp_start_if_needed+0x1c8>
> c000000000226ed8: 28 00 21 e9 ld r9,40(r1)
> c000000000226edc: 78 0c 4a e9 ld r10,3192(r10)
> c000000000226ee0: 79 52 29 7d xor. r9,r9,r10
> c000000000226ee4: 00 00 40 39 li r10,0
> c000000000226ee8: b8 03 82 40 bne c0000000002272a0
> <srcu_gp_start_if_needed+0x5a0>
> by debugging, I see the r10 is assigned with r13 on c000000000226eb4,
> but if there is a context-switch before c000000000226edc, a false
> positive will be reported.
>
> [1] http://154.220.3.115/logs/0422/configformainline.txt
> [2] 154.220.3.115/logs/0422/whilebash.sh
> [3] http://154.220.3.115/logs/0422/srcu_gp_start_if_needed.txt
>
> My analysis and debugging may not be correct, but the bug is easily
> reproducible.

Could you provide the full kernel log? It is not clear exactly from
your attachments, but I think this is a stack overflow issue as
implied by the mention of __stack_chk_fail. One trick might be to turn
on any available stack debug kernel config options, or check the
kernel logs for any messages related to shortage of remaining stack
space.

Additionally, you could also find out where the kernel crash happened
in C code following the below notes [1] I wrote (see section "Figuring
out where kernel crashes happen in C code"). The notes are
x86-specific but should be generally applicable (In the off chance
you'd like to improve the notes, feel free to share them ;-)).

Lastly, is it a specific kernel release from which you start seeing
this issue? You should try git bisect if it is easily reproducible in
a newer release, but goes away in an older one.

I will also join you in your debug efforts soon though I am currently
in between conferences.

[1] https://gist.github.com/joelagnel/ae15c404facee0eb3ebb8aff0e996a68

thanks,

- Joel




>
> Thanks
> Zhouyi