Re: [PATCH stable 5.4,5.10] x86/fpu: Correct pkru/xstate inconsistency

From: Brian Geffon
Date: Thu Feb 17 2022 - 08:31:58 EST


On Wed, Feb 16, 2022 at 10:16 AM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>
> On 2/16/22 02:05, Greg KH wrote:
> >>> How was this tested, and what do the maintainers of this subsystem
> >>> think? And will you be around to fix the bugs in this when they are
> >>> found?
> >> This has been trivial to reproduce, I've used a small repro which I've
> >> put here: https://gist.github.com/bgaff/9f8cbfc8dd22e60f9492e4f0aff8f04f
> >> , I also was able to reproduce this using the protection_keys self
> >> tests on a 11th Gen Core i5-1135G7. I'm happy to commit to addressing
> >> any bugs that may appear. I'll see what the maintainers say, but there
> >> is also a smaller fix that just involves using this_cpu_read() in
> >> switch_fpu_finish() for this specific issue, although that approach
> >> isn't as clean.
> > Can you add the test to the in-kernel tests so that we make sure it is
> > fixed and never comes back?
>
> It would be great if Brian could confirm this. But, I'm 99% sure that
> this can be reproduced in the vm/protection_keys.c selftest, if you run
> it for long enough.

Hi Dave,
Yes, this is reproduced by the protection keys selfs tests. If your
kernel was compiled in a way which caches current_task when read via
this_cpu_read_stable(), then when switching from a kernel thread to a
user thread you will observe this behavior, so the only situation
where it's timing related is waiting for that switch from a kernel to
user thread. If your kernel is compiled in a way which does not cache
the value of current_task then you will never observe this behavior.
My understanding is that this is perfectly valid for the compiler to
produce that code.

>
> The symptom here is corruption of the PKRU register. I created *lots*
> of bugs like this during protection keys development so the selftest
> keeps a shadow copy of the register to specifically watch for corruption.
>
> It's _plausible_ that no one ever ran the pkey selftests with a
> clang-compiled kernel for long enough to hit this issue.

For ChromeOS we use clang and when I tested a vanilla 5.10 kernel
_built with clang_ it also reproduced, so I suspect you're right that
the self tests just haven't been run against clang built kernels all
that often.

How would you and Greg KH like to proceed with this? I'm happy to help
however I can.

Brian