Re: selftests: ftrace: Internal error: Oops: sve_save_state

From: Daniel Díaz
Date: Tue Jan 30 2024 - 19:15:29 EST


Hello!

On 20/12/23 7:06 p. m., Mark Brown wrote:
On Wed, Dec 20, 2023 at 06:06:53PM -0600, Daniel Díaz wrote:

We have been seeing this problem in other instances, specifically on
the following kernels:
* 5.15.132, 5.15.134-rc1, 5.15.135, 5.15.136-rc1, 5.15.142, 5.15.145-rc1
* 6.1.42, 6.1.43, 6.1.51-rc1, 6.1.56-rc1, 6.1.59-rc1, 6.1.63
* 6.3.10, 6.3.11
* 6.4.7
* 6.5.2, 6.5.10-rc2
[...]> Looking at the list of versions you've posted the earliest is from the
very end of June with others in July, was there something that changed
in your test environment in broadly that time?

Yes! Important to note here is that I started collecting that date around that time, second half of 2023. It's possible that the same problem was there before but I didn't register that.


I see that the
logs you and Naresh posted are both using a Debian 12/Bookworm based
root filesystem and that was released a couple of weeks before this
started appearing, Bookworm introduced glibc usage of SVE which makes
usage much more common. Is this perhaps tied to you upgrading your root
filesystems to Bookworm or were you tracking testing before then?

I don't have an easy way to swap the Debian version right now, I'll have to ask around, but that might be a good experiment.

This message appears so sporadically that it would be hard to determine with 100% certainty if the changes we make really fix the issue. We can try to figure out just how sporadic that error is, though, and go from there, but the numbers looks slim with this happening once every so many weeks.


Most recent case is for the current 5.15 RC. Decoded stack trace is here:
-----8<-----
<4>[ 29.297166] ------------[ cut here ]------------
<4>[ 29.298039] WARNING: CPU: 1 PID: 220 at
arch/arm64/kernel/fpsimd.c:950 do_sve_acc
(/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1))

That's an assert that we shouldn't take a SVE trap when SVE is
alreadly enabled for the thread. The backtrace Naresh originally
supplied was a NULL pointer dereference attempting to save SVE state
(indicating that we think we're trying to save SVE state but don't have
any storage allocated for it) during thread switch. It's very plausible
that the two are the same underlying issue but it's also not 100% a
given. Can you double check exactly how similar the various issues you
are seeing are please?

I'm not sure of how similar the test cases are, but I'm inclined to think its occurrence is not related specifically to ftrace or one test case. It looks like these appear on FVP and Qemu-arm64, on several LTS branches, and while running different set of tests. On today's episode, we had it happening in the middle of ltp-controllers' cgroup_fj_stress:

-----8<-----
<4>[ 619.431962] ------------[ cut here ]------------
<4>[ 619.432530] WARNING: CPU: 0 PID: 10382 at arch/arm64/kernel/fpsimd.c:1403 do_sve_acc (arch/arm64/kernel/fpsimd.c:1402)
<4>[ 619.433889] Modules linked in: crct10dif_ce sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 fuse drm dm_mod ip_tables x_tables
<4>[ 619.436649] CPU: 0 PID: 10382 Comm: cgroup_fj_stres Not tainted 6.1.76-rc2 #1
<4>[ 619.437435] Hardware name: linux,dummy-virt (DT)
<4>[ 619.438319] pstate: 42400009 (nZcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
<4>[ 619.439680] pc : do_sve_acc (arch/arm64/kernel/fpsimd.c:1402)
<4>[ 619.440003] lr : do_sve_acc (arch/arm64/kernel/fpsimd.c:227 arch/arm64/kernel/fpsimd.c:253 arch/arm64/kernel/fpsimd.c:1400)
<4>[ 619.440572] sp : ffff80000f3e3e40
<4>[ 619.441022] x29: ffff80000f3e3e40 x28: ffff0000c4e42100 x27: 0000000000000000
<4>[ 619.441860] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
<4>[ 619.442789] x23: 0000000080001000 x22: 0000ffffaa47a454 x21: 0000000000000001
<4>[ 619.443893] x20: ffffb6cd7e13c0b0 x19: ffff0000c4e42100 x18: 0000000000000000
<4>[ 619.445033] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
<4>[ 619.446017] x14: 0000000000000000 x13: ffff80000f3e0000 x12: ffff80000f3e4000
<4>[ 619.446991] x11: 25fdfcd46891cd00 x10: 0000000000000200 x9 : 0000000000000000
<4>[ 619.447614] x8 : 0000000000800000 x7 : 0000000000000000 x6 : 000000000000003f
<4>[ 619.448285] x5 : 0000000000000040 x4 : 0000000000000000 x3 : 0000000000000000
<4>[ 619.448963] x2 : 0000000000000008 x1 : ffffb6cd7c019eac x0 : ffffb6cd7c019eac
<4>[ 619.449852] Call trace:
<4>[ 619.450219] do_sve_acc (arch/arm64/kernel/fpsimd.c:1402)
<4>[ 619.450983] el0_sve_acc (arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:142 arch/arm64/kernel/entry-common.c:550)
<4>[ 619.451394] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:0)
<4>[ 619.451849] el0t_64_sync (arch/arm64/kernel/entry.S:585)
<4>[ 619.452333] ---[ end trace 0000000000000000 ]---
----->8-----

The full log, reproducers, artifacts, can be found here:

https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2bgeMqo59p7kgw5dKw0m5uNegzy

Greetings!

Daniel Díaz
daniel.diaz@xxxxxxxxxx