Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

From: Linus Torvalds
Date: Mon Oct 16 2023 - 19:02:44 EST


On Mon, 16 Oct 2023 at 13:35, Nadav Amit <namit@xxxxxxxxxx> wrote:
>
> I have encountered several such issues before [1], and while some have been fixed,
> some have not (I looked at switch_fpu_finish()), and might under the right/wrong
> circumstances use the wrongly-“cached” current. Moreover, perhaps new problems
> have been added since my old patch.

Yeah, that fpu switching is disgusting and borderline buggy. And yes,
it would trigger problems when caching the value of 'current'.

I don't particularly love the patch you pointed at, because it seems
to have only fixed the switch_fpu_finish() case, which is the one that
presumably triggered issues, but that's not a very pretty fix.

switch_fpu_prepare() has the exact same problem, and in fact is likely
the *source* of the issue, because that's the original "load current"
that then ends up being cached incorrectly later in __switch_to().

The whole

struct fpu *prev_fpu = &prev->fpu;

thing in __switch_to() is pretty ugly. There's no reason why we should
look at that 'prev_fpu' pointer there, or pass it down.

And it only generates worse code, in how it loads 'current' when
t__switch_to() has the right task pointers.

So the attached patch is, I think, the right thing to do. It may not
be the *complete* fix, but at least for the config I tested, this
makes all loads of 'current' go away in the resulting generated
assembly, and the only access to '%gs:pcpu_hot(%rip)' is the write to
update it:

movq %rbx, %gs:pcpu_hot(%rip)

from that

raw_cpu_write(pcpu_hot.current_task, next_p);

code.

Thomas, I think you've touched this code last, but even that isn't
very recent. The attached patch not only cleans this up, it actually
generates better code too:

(a) it removes one push/pop pair at entry/exit because there's one
less register used (no 'current')

(b) it removes that pointless load of 'current' because it just uses
the right argument:

- #APP
- movq %gs:pcpu_hot(%rip), %r12
- #NO_APP
- testq $16384, (%r12)
+ testq $16384, (%rdi)

so I think this is the right thing to do. I checked that the 32-bit
code builds and looks sane too.

I do think the 'old/new' naming in the FPU code should probably be
'prev/next' to match the switch_to() naming, but I didn't do that.

Comments?

Linus

Linus
arch/x86/include/asm/fpu/sched.h | 10 ++++++----
arch/x86/kernel/process_32.c | 7 +++----
arch/x86/kernel/process_64.c | 7 +++----
3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index ca6e5e5f16b2..c485f1944c5f 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -37,10 +37,12 @@ extern void fpu_flush_thread(void);
* The FPU context is only stored/restored for a user task and
* PF_KTHREAD is used to distinguish between kernel and user threads.
*/
-static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
+static inline void switch_fpu_prepare(struct task_struct *old, int cpu)
{
if (cpu_feature_enabled(X86_FEATURE_FPU) &&
- !(current->flags & (PF_KTHREAD | PF_USER_WORKER))) {
+ !(old->flags & (PF_KTHREAD | PF_USER_WORKER))) {
+ struct fpu *old_fpu = &old->thread.fpu;
+
save_fpregs_to_fpstate(old_fpu);
/*
* The save operation preserved register state, so the
@@ -60,10 +62,10 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
* Delay loading of the complete FPU state until the return to userland.
* PKRU is handled separately.
*/
-static inline void switch_fpu_finish(void)
+static inline void switch_fpu_finish(struct task_struct *new)
{
if (cpu_feature_enabled(X86_FEATURE_FPU))
- set_thread_flag(TIF_NEED_FPU_LOAD);
+ set_tsk_thread_flag(new, TIF_NEED_FPU_LOAD);
}

#endif /* _ASM_X86_FPU_SCHED_H */
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 708c87b88cc1..0917c7f25720 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -156,13 +156,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
struct thread_struct *prev = &prev_p->thread,
*next = &next_p->thread;
- struct fpu *prev_fpu = &prev->fpu;
int cpu = smp_processor_id();

/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */

- if (!test_thread_flag(TIF_NEED_FPU_LOAD))
- switch_fpu_prepare(prev_fpu, cpu);
+ if (!test_tsk_thread_flag(prev_p, TIF_NEED_FPU_LOAD))
+ switch_fpu_prepare(prev_p, cpu);

/*
* Save away %gs. No need to save %fs, as it was saved on the
@@ -209,7 +208,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)

raw_cpu_write(pcpu_hot.current_task, next_p);

- switch_fpu_finish();
+ switch_fpu_finish(next_p);

/* Load the Intel cache allocation PQR MSR. */
resctrl_sched_in(next_p);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 33b268747bb7..1553e19904e0 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -562,14 +562,13 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
struct thread_struct *prev = &prev_p->thread;
struct thread_struct *next = &next_p->thread;
- struct fpu *prev_fpu = &prev->fpu;
int cpu = smp_processor_id();

WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
this_cpu_read(pcpu_hot.hardirq_stack_inuse));

- if (!test_thread_flag(TIF_NEED_FPU_LOAD))
- switch_fpu_prepare(prev_fpu, cpu);
+ if (!test_tsk_thread_flag(prev_p, TIF_NEED_FPU_LOAD))
+ switch_fpu_prepare(prev_p, cpu);

/* We must save %fs and %gs before load_TLS() because
* %fs and %gs may be cleared by load_TLS().
@@ -623,7 +622,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
raw_cpu_write(pcpu_hot.current_task, next_p);
raw_cpu_write(pcpu_hot.top_of_stack, task_top_of_stack(next_p));

- switch_fpu_finish();
+ switch_fpu_finish(next_p);

/* Reload sp0. */
update_task_stack(next_p);