Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch

From: Herrenschmidt, Benjamin
Date: Sun Mar 22 2020 - 01:08:45 EST


On Fri, 2020-03-20 at 12:49 +0100, Thomas Gleixner wrote:
> Balbir,
>
> "Singh, Balbir" <sblbir@xxxxxxxxxx> writes:
> > On Thu, 2020-03-19 at 01:38 +0100, Thomas Gleixner wrote:
> > > What's the point? The attack surface is the L1D content of the scheduled
> > > out task. If the malicious task schedules out, then why would you care?
> > >
> > > I might be missing something, but AFAICT this is beyond paranoia.
> > >
> >
> > I think there are two cases
> >
> > 1. Task with important data schedules out
> > 2. Malicious task schedules in
> >
> > These patches address 1, but call out case #2
>
> The point is if the victim task schedules out, then there is no reason
> to flush L1D immediately in context switch. If that just schedules a
> kernel thread and then goes back to the task, then there is no point
> unless you do not even trust the kernel thread.

A switch to a kernel thread will not call switch_mm, will it ? At least it used not to...

> > > > 3. There is a fallback software L1D load, similar to what L1TF does, but
> > > > we don't prefetch the TLB, is that sufficient?
> > >
> > > If we go there, then the KVM L1D flush code wants to move into general
> > > x86 code.
> >
> > OK.. we can definitely consider reusing code, but I think the KVM bits require
> > tlb prefetching, IIUC before cache flush to negate any bad translations
> > associated with an L1TF fault, but the code/comments are not clear on the need
> > to do so.
>
> I forgot the gory details by now, but having two entry points or a
> conditional and share the rest (page allocation etc.) is definitely
> better than two slightly different implementation which basically do the same thing.
>
> > > > +void enable_l1d_flush_for_task(struct task_struct *tsk)
> > > > +{
> > > > + struct page *page;
> > > > +
> > > > + if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
> > > > + goto done;
> > > > +
> > > > + mutex_lock(&l1d_flush_mutex);
> > > > + if (l1d_flush_pages)
> > > > + goto done;
> > > > + /*
> > > > + * These pages are never freed, we use the same
> > > > + * set of pages across multiple processes/contexts
> > > > + */
> > > > + page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
> > > > + if (!page)
> > > > + return;
> > > > +
> > > > + l1d_flush_pages = page_address(page);
> > > > + /* I don't think we need to worry about KSM */
> > >
> > > Why not? Even if it wouldn't be necessary why would we care as this is a
> > > once per boot operation in fully preemptible code.
> >
> > Not sure I understand your question, I was stating that even if KSM was
> > running, it would not impact us (with dedup), as we'd still be writing out 0s
> > to the cache line in the fallback case.
>
> I probably confused myself vs. the comment in the VMX code, but that
> mentions nested virt. Needs at least some consideration when we reuse
> that code.
>
> > > > void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > > > struct task_struct *tsk)
> > > > {
> > > > @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> > > > mm_struct *next,
> > > > trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
> > > > }
> > > >
> > > > + l1d_flush(next, tsk);
> > >
> > > This is really the wrong place. You want to do that:
> > >
> > > 1) Just before return to user space
> > > 2) When entering a guest
> > >
> > > and only when the previously running user space task was the one which
> > > requested this massive protection.
> > >
> >
> > Cases 1 and 2 are handled via
> >
> > 1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion)
>
> How so? SWAPGS mitigation does not flush L1D. It merily serializes SWAPGS.

> > 2. L1TF fault handling
> >
> > This mechanism allows for flushing not restricted to 1 or 2, the idea is to
> > immediately flush L1D for paranoid processes on mm switch.
>
> Why? To protect the victim task against the malicious kernel?

Mostly malicious other tasks for us. As I said, I don't think switch_mm
is called on switching to a kernel thread and is definitely a colder
path than the return to userspace, so it felt like the right place to
put this, but I don't mind if you prefer it elsewhere as long as it
does the job which is to prevent task B to snoop task A data.

> The L1D content of the victim is endangered in the following case:
>
> victim out -> attacker in
>
> The attacker can either run in user space or in guest mode. So the flush
> is only interesting when the attacker actually goes back to user space
> or reenters the guest.
>
> The following is completely uninteresting:
>
> victim out -> kernel thread in/out -> victim in

Sure but will that call switch_mm to be called ?

> Even this is uninteresting:
>
> victim in -> attacker in (stays in kernel, e.g. waits for data) ->
> attacker out -> victim in

I don't get this ... how do you get attacker_in without victim_out
first ? In which case you have a victim_out -> attacker_in transition
which is what we are trying to protect.

I still think flushing the "high value" process L1D on switch_mm out is
the way to go here...

> So the point where you want to flush conditionally is when the attacker
> leaves kernel space either to user mode or guest mode.
>
> So if the victim schedules out it sets a per cpu request to flush L1D
> on the borders.
>
> And then you have on return to user:
>
> if (this_cpu_flush_l1d())
> flush_l1d()
>
> and in kvm:
>
> if (this_cpu_flush_l1d() || L1TF_flush_L1D)
> flush_l1d()
>
> The request does:
>
> if (!this_cpu_read(l1d_flush_for_task))
> this_cpu_write(l1d_flush_for_task, current)
>
> The check does:
>
> p = this_cpu_read(l1d_flush_for_task);
> if (p) {
> this_cpu_write(l1d_flush_for_task, NULL);
> return p != current;
> }
> return false;
>
> Hmm?
>
> Thanks,
>
> tglx
>
>