Re: selftests: ftrace: Internal error: Oops: sve_save_state

From: Mark Brown
Date: Wed Dec 20 2023 - 20:06:45 EST


On Wed, Dec 20, 2023 at 06:06:53PM -0600, Daniel Díaz wrote:

> We have been seeing this problem in other instances, specifically on
> the following kernels:
> * 5.15.132, 5.15.134-rc1, 5.15.135, 5.15.136-rc1, 5.15.142, 5.15.145-rc1
> * 6.1.42, 6.1.43, 6.1.51-rc1, 6.1.56-rc1, 6.1.59-rc1, 6.1.63
> * 6.3.10, 6.3.11
> * 6.4.7
> * 6.5.2, 6.5.10-rc2

This is a huge range of kernels with some substantial reworkings of
the FP code, and I do note that v5.15 appears to have backported only
one change there (an incidental one related to ESR handling). This
makes me think this is likely to be something that's been sitting there
for a very long time and is unrelated to those versions and any changes
that went into them. I see you're still testing back to v4.19 which
suggests an issue introduced between v5.10 and v5.15, my change
cccb78ce89c45a4 ("arm64/sve: Rework SVE access trap to convert state in
registers") does jump out there though I don't immediately see what the
issue would be.

Looking at the list of versions you've posted the earliest is from the
very end of June with others in July, was there something that changed
in your test environment in broadly that time? I see that the
logs you and Naresh posted are both using a Debian 12/Bookworm based
root filesystem and that was released a couple of weeks before this
started appearing, Bookworm introduced glibc usage of SVE which makes
usage much more common. Is this perhaps tied to you upgrading your root
filesystems to Bookworm or were you tracking testing before then?

> Most recent case is for the current 5.15 RC. Decoded stack trace is here:
> -----8<-----
> <4>[ 29.297166] ------------[ cut here ]------------
> <4>[ 29.298039] WARNING: CPU: 1 PID: 220 at
> arch/arm64/kernel/fpsimd.c:950 do_sve_acc
> (/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1))

That's an assert that we shouldn't take a SVE trap when SVE is
alreadly enabled for the thread. The backtrace Naresh originally
supplied was a NULL pointer dereference attempting to save SVE state
(indicating that we think we're trying to save SVE state but don't have
any storage allocated for it) during thread switch. It's very plausible
that the two are the same underlying issue but it's also not 100% a
given. Can you double check exactly how similar the various issues you
are seeing are please?

I have coincidentally been chasing some other stuff in the past week or
two which might potentially be different manifestations of the same
underlying issue with current code, broadly in the area of the register
state and task state getting out of sync.

Attachment: signature.asc
Description: PGP signature