Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch

From: Herrenschmidt, Benjamin
Date: Sun Mar 22 2020 - 01:02:18 EST


On Thu, 2020-03-19 at 01:38 +0100, Thomas Gleixner wrote:
> Balbir,
>
> Balbir Singh <sblbir@xxxxxxxxxx> writes:
> > This patch is an RFC/PoC to start the discussion on optionally flushing
> > L1D cache. The goal is to allow tasks that are paranoid due to the recent
> > snoop assisted data sampling vulnerabilites, to flush their L1D on being
> > switched out. This protects their data from being snooped or leaked via
> > side channels after the task has context switched out.
>
> What you mean is that L1D is flushed before another task which does not
> belong to the same trust zone returns to user space or enters guest
> mode.

To some extent, though we haven't really got a concept of "trust zone"
at this point, so we assume an mm will do. Typically it's going to be
tasks that manipulate sensitive data (customer data for example) that
want to have extra protection against other tasks (or containers) in
the system.

One use case is language runtimes running different customer programs
in different tasks needing to be protected from snooping at each other.

> > There are two scenarios we might want to protect against, a task leaving
> > the CPU with data still in L1D (which is the main concern of this
> > patch), the second scenario is a malicious task coming in (not so well
> > trusted) for which we want to clean up the cache before it starts
> > execution. The latter was proposed by benh and is not currently
> > addressed by this patch, but can be easily accommodated by the same
> > mechanism.
>
> What's the point? The attack surface is the L1D content of the scheduled
> out task. If the malicious task schedules out, then why would you care?
>
> I might be missing something, but AFAICT this is beyond paranoia.

So I was just suggesting the symetric variant. IE. The original issue
we try to address is a "high value" task and prevent its data being
snooped by a "low value" environment (ie. another task). Thus flush
when switching out.

I was suggesting the other way around might be useful for other
reasons: a task running untrusted code (ie, your browser javascript
interpreter) which you want to disallow any chance of snooping the L1D
of other things in your system. In this case, you want a flush on the
way into the task to protect everything else.

Both are opt-in.

> > The points of discussion/review are:
> >
> > 1. Discuss the use case and the right approach to address this
>
> It might be the quick fix for the next not yet known variant of L1D
> based horrors, so yes it's at least worth to discuss it.

It also address existing variants for which we have workarounds for VMs
but not necessarily for processes/containers. But yes, the idea here is
to have a somewhat "blanket" protection of data. It's never perfect
(branch caches etc...) but it's useful when we know we are running some
tasks with sensitive data and are willing to pay the price on context
switch for these.

> > 2. Does an arch prctl allowing for opt-in flushing make sense, would other
> > arches care about something similar?
>
> No idea, but I assume in the light of L1D based issues that might be the
> case. Though that still is per architecture as the L1D flush mechanisms
> are architecture specific. Aside of that I don't think that more than a
> few will enable / support it.

We should thrive to define the prctl itself in as much arch agnostic
way as possible. I suspect we'll want the same for ARM64 soon :-) This
is in line with what Kees suggests in an earlier response. The
implementation remains arch specific.

> > 3. There is a fallback software L1D load, similar to what L1TF does, but
> > we don't prefetch the TLB, is that sufficient?
>
> If we go there, then the KVM L1D flush code wants to move into general
> x86 code.
>
> > 4. The atomics can be improved and we could use a static key like ibpb
> > does to optimize the code path
>
> Yes to static key.
>
> > 5. The code works with a special hack for 64 bit systems (TIF_L1D_FLUSH
> > is bit 32), we could generalize it with some effort
>
> Why so? There are gaps in the TIF flags (18, 23, 26). Why do you want to
> push that to 32?
>
> > 6. Should we consider cleaning up the L1D on arrival of tasks?
>
> That's the Ben idea, right? If so see above.
>
> > +#define L1D_CACHE_ORDER 4
> > +static void *l1d_flush_pages;
> > +static DEFINE_MUTEX(l1d_flush_mutex);
> > +
> > +void enable_l1d_flush_for_task(struct task_struct *tsk)
> > +{
> > + struct page *page;
> > +
> > + if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
> > + goto done;
> > +
> > + mutex_lock(&l1d_flush_mutex);
> > + if (l1d_flush_pages)
> > + goto done;
> > + /*
> > + * These pages are never freed, we use the same
> > + * set of pages across multiple processes/contexts
> > + */
> > + page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
> > + if (!page)
> > + return;
> > +
> > + l1d_flush_pages = page_address(page);
> > + /* I don't think we need to worry about KSM */
>
> Why not? Even if it wouldn't be necessary why would we care as this is a
> once per boot operation in fully preemptible code.
>
> Aside of that why do we need another pointlessly different copy of what
> we have in the VMX code?
>
> > +done:
> > + set_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> > + mutex_unlock(&l1d_flush_mutex);
> > +}
> > +
> > +void disable_l1d_flush_for_task(struct task_struct *tsk)
> > +{
> > + clear_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> > + smp_mb__after_atomic();
>
> Lacks an explanation / comment what this smp_mb is for and where the
> counterpart lives.
>
> Aside of that, this is pointless AFAICT. Disable/enable is really not a
> concern of being perfect. If you want perfect protection, simply switch
> off your computer.
>
> > +/*
> > + * Flush the L1D cache for this CPU. We want to this at switch mm time,
> > + * this is a pessimistic security measure and an opt-in for those tasks
> > + * that host sensitive information and there are concerns about spills
> > + * from fill buffers.
> > + */
> > +static void l1d_flush(struct mm_struct *next, struct task_struct *tsk)
> > +{
> > + struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> > + int size = PAGE_SIZE << L1D_CACHE_ORDER;
> > +
> > + if (this_cpu_read(cpu_tlbstate.last_user_mm_l1d_flush) == 0)
> > + goto check_next;
> > +
> > + if (real_prev == next)
> > + return;
> > +
> > + if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> > + wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> > + goto done;
> > + }
> > +
> > + asm volatile(
> > + /* Fill the cache */
> > + "xorl %%eax, %%eax\n"
> > + ".Lfill_cache:\n"
> > + "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> > + "addl $64, %%eax\n\t"
> > + "cmpl %%eax, %[size]\n\t"
> > + "jne .Lfill_cache\n\t"
> > + "lfence\n"
> > + :: [flush_pages] "r" (l1d_flush_pages),
> > + [size] "r" (size)
> > + : "eax", "ecx");
>
> Yet moar copied code slighlty different for no reason.
>
> > +
> > +done:
> > + this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 0);
> > + /* Make sure we clear the values before we set it again */
> > + barrier();
> > +check_next:
> > + if (tsk == NULL)
> > + return;
> > +
> > + /* Match the set/clear_bit barriers */
> > + smp_rmb();
>
> What for again?
>
> > + /* We don't need stringent checks as we opt-in/opt-out */
> > + if (test_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH))
> > + this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 1);
> > +}
> > +
> > void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > struct task_struct *tsk)
> > {
> > @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> > trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
> > }
> >
> > + l1d_flush(next, tsk);
>
> This is really the wrong place. You want to do that:
>
> 1) Just before return to user space
> 2) When entering a guest
>
> and only when the previously running user space task was the one which
> requested this massive protection.

Why ? switch_mm will not do anything when switching to/from kernel
threads right ? Or did things change since last I look at it (granted a
while ago...). It seems like a reasonable place and definitely less hot
than the return-to-userspace path.

> While it's worth to discuss, I'm not yet convinced that this is worth
> the trouble.

It's a pretty serious request for us, people really want this when
running different processes that can today demonstratively spy on each
other caches and belong to different customers. Think of containers for
example. VMs are a solution since we have such workarounds in KVM I
believe, but much heavier weight.

Cheers,
Ben.


> Thanks,
>
> tglx