Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch

From: Singh, Balbir
Date: Thu Mar 19 2020 - 21:41:00 EST


On Thu, 2020-03-19 at 01:38 +0100, Thomas Gleixner wrote:
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> Balbir,
>
> Balbir Singh <sblbir@xxxxxxxxxx> writes:
> > This patch is an RFC/PoC to start the discussion on optionally flushing
> > L1D cache. The goal is to allow tasks that are paranoid due to the recent
> > snoop assisted data sampling vulnerabilites, to flush their L1D on being
> > switched out. This protects their data from being snooped or leaked via
> > side channels after the task has context switched out.
>
> What you mean is that L1D is flushed before another task which does not
> belong to the same trust zone returns to user space or enters guest
> mode.
>
> > There are two scenarios we might want to protect against, a task leaving
> > the CPU with data still in L1D (which is the main concern of this
> > patch), the second scenario is a malicious task coming in (not so well
> > trusted) for which we want to clean up the cache before it starts
> > execution. The latter was proposed by benh and is not currently
> > addressed by this patch, but can be easily accommodated by the same
> > mechanism.
>
> What's the point? The attack surface is the L1D content of the scheduled
> out task. If the malicious task schedules out, then why would you care?
>
> I might be missing something, but AFAICT this is beyond paranoia.
>

I think there are two cases

1. Task with important data schedules out
2. Malicious task schedules in

These patches address 1, but call out case #2


> > The points of discussion/review are:
> >
> > 1. Discuss the use case and the right approach to address this
>
> It might be the quick fix for the next not yet known variant of L1D
> based horrors, so yes it's at least worth to discuss it.
>

Yes, that is the broader thinking, there might be yet to be discovered attack
vectors and having something like this provides a good workaround

> > 2. Does an arch prctl allowing for opt-in flushing make sense, would other
> > arches care about something similar?
>
> No idea, but I assume in the light of L1D based issues that might be the
> case. Though that still is per architecture as the L1D flush mechanisms
> are architecture specific. Aside of that I don't think that more than a
> few will enable / support it.
>

Fair enough

> > 3. There is a fallback software L1D load, similar to what L1TF does, but
> > we don't prefetch the TLB, is that sufficient?
>
> If we go there, then the KVM L1D flush code wants to move into general
> x86 code.

OK.. we can definitely consider reusing code, but I think the KVM bits require
tlb prefetching, IIUC before cache flush to negate any bad translations
associated with an L1TF fault, but the code/comments are not clear on the need
to do so.

>
> > 4. The atomics can be improved and we could use a static key like ibpb
> > does to optimize the code path
>
> Yes to static key.

Sure, will do

>
> > 5. The code works with a special hack for 64 bit systems (TIF_L1D_FLUSH
> > is bit 32), we could generalize it with some effort
>
> Why so? There are gaps in the TIF flags (18, 23, 26). Why do you want to
> push that to 32?

I might have missed the holes, I can move to using one of them, thanks.

>
> > 6. Should we consider cleaning up the L1D on arrival of tasks?
>
> That's the Ben idea, right? If so see above.

Yes

>
> > +#define L1D_CACHE_ORDER 4
> > +static void *l1d_flush_pages;
> > +static DEFINE_MUTEX(l1d_flush_mutex);
> > +
> > +void enable_l1d_flush_for_task(struct task_struct *tsk)
> > +{
> > + struct page *page;
> > +
> > + if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
> > + goto done;
> > +
> > + mutex_lock(&l1d_flush_mutex);
> > + if (l1d_flush_pages)
> > + goto done;
> > + /*
> > + * These pages are never freed, we use the same
> > + * set of pages across multiple processes/contexts
> > + */
> > + page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
> > + if (!page)
> > + return;
> > +
> > + l1d_flush_pages = page_address(page);
> > + /* I don't think we need to worry about KSM */
>
> Why not? Even if it wouldn't be necessary why would we care as this is a
> once per boot operation in fully preemptible code.

Not sure I understand your question, I was stating that even if KSM was
running, it would not impact us (with dedup), as we'd still be writing out 0s
to the cache line in the fallback case.

>
> Aside of that why do we need another pointlessly different copy of what
> we have in the VMX code?
>
> > +done:
> > + set_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> > + mutex_unlock(&l1d_flush_mutex);
> > +}
> > +
> > +void disable_l1d_flush_for_task(struct task_struct *tsk)
> > +{
> > + clear_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> > + smp_mb__after_atomic();
>
> Lacks an explanation / comment what this smp_mb is for and where the
> counterpart lives.
>
> Aside of that, this is pointless AFAICT. Disable/enable is really not a
> concern of being perfect. If you want perfect protection, simply switch
> off your computer.

Yes, the counter part was in the l1d_flush(), which did an smp_rmb(), but the
comment does point out that it can and will be lazy, so we can safely remove
this bit if needed.

>
> > +/*
> > + * Flush the L1D cache for this CPU. We want to this at switch mm time,
> > + * this is a pessimistic security measure and an opt-in for those tasks
> > + * that host sensitive information and there are concerns about spills
> > + * from fill buffers.
> > + */
> > +static void l1d_flush(struct mm_struct *next, struct task_struct *tsk)
> > +{
> > + struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> > + int size = PAGE_SIZE << L1D_CACHE_ORDER;
> > +
> > + if (this_cpu_read(cpu_tlbstate.last_user_mm_l1d_flush) == 0)
> > + goto check_next;
> > +
> > + if (real_prev == next)
> > + return;
> > +
> > + if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> > + wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> > + goto done;
> > + }
> > +
> > + asm volatile(
> > + /* Fill the cache */
> > + "xorl %%eax, %%eax\n"
> > + ".Lfill_cache:\n"
> > + "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> > + "addl $64, %%eax\n\t"
> > + "cmpl %%eax, %[size]\n\t"
> > + "jne .Lfill_cache\n\t"
> > + "lfence\n"
> > + :: [flush_pages] "r" (l1d_flush_pages),
> > + [size] "r" (size)
> > + : "eax", "ecx");
>
> Yet moar copied code slighlty different for no reason.
>

Different because of the need to not prefetch the TLB

> > +
> > +done:
> > + this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 0);
> > + /* Make sure we clear the values before we set it again */
> > + barrier();
> > +check_next:
> > + if (tsk == NULL)
> > + return;
> > +
> > + /* Match the set/clear_bit barriers */
> > + smp_rmb();
>
> What for again?

Yes, not needed (the comment below sort of says that), I can remove this

>
> > + /* We don't need stringent checks as we opt-in/opt-out */
> > + if (test_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH))
> > + this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 1);
> > +}
> > +
> > void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > struct task_struct *tsk)
> > {
> > @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> > mm_struct *next,
> > trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
> > }
> >
> > + l1d_flush(next, tsk);
>
> This is really the wrong place. You want to do that:
>
> 1) Just before return to user space
> 2) When entering a guest
>
> and only when the previously running user space task was the one which
> requested this massive protection.
>

Cases 1 and 2 are handled via

1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion)
2. L1TF fault handling

This mechanism allows for flushing not restricted to 1 or 2, the idea is to
immediately flush L1D for paranoid processes on mm switch.

> >

> While it's worth to discuss, I'm not yet convinced that this is worth
> the trouble.
>

Are you suggesting the mechanism is not worth building?

Balbir Singh.