Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch

From: Singh, Balbir
Date: Sun Mar 22 2020 - 20:38:26 EST


Hi, Thomas,

On Sat, 2020-03-21 at 11:05 +0100, Thomas Gleixner wrote:
>
>
> Balbir,
>
> "Singh, Balbir" <sblbir@xxxxxxxxxx> writes:
> > On Fri, 2020-03-20 at 12:49 +0100, Thomas Gleixner wrote:
> > > I forgot the gory details by now, but having two entry points or a
> > > conditional and share the rest (page allocation etc.) is definitely
> > > better than two slightly different implementation which basically do the
> > > same thing.
> >
> > OK, I can try and dedup them to the extent possible, but please do
> > remember
> > that
> >
> > 1. KVM is usually loaded as a module
> > 2. KVM is optional
> >
> > We can share code, by putting the common bits in the core kernel.
>
> Obviously so.
>
> > > > 1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion)
> > >
> > > How so? SWAPGS mitigation does not flush L1D. It merily serializes
> > > SWAPGS.
> >
> > Sorry, my bad, I was thinking MDS_CLEAR (via verw), which does flush out
> > things, which I suspect should be sufficient from a return to user/signal
> > handling, etc perspective.
>
> MDS is affecting store buffers, fill buffers and load ports. Different
> story.
>

Yes, what gets me is that as per (
https://software.intel.com/security-software-guidance/insights/deep-dive-intel-analysis-microarchitectural-data-sampling
) it says, "The VERW instruction and L1D_FLUSH command will overwrite the
store buffer value for the current logical processor on processors affected by
MSBDS". In my mind, this makes VERW the same as L1D_FLUSH and hence the
assumption, it could be that L1D_FLUSH is a superset, but it's not clear and I
can't seem to find any other form of documentation on the MSRs and microcode.

> > Right now, reading through
> >
https://software.intel.com/security-software-guidance/insights/deep-dive-snoop-assisted-l1-data-sampling
> > , it does seem like we need this during a context switch, specifically
> > since a
> > dirty cache line can cause snooped reads for the attacker to leak data. Am
> > I
> > missing anything?
>
> Yes. The way this goes is:
>
> CPU0 CPU1
>
> victim1
> store secrit
> victim2
> attacker read secrit
>
> Now if L1D is flushed on CPU0 before attacker reaches user space,
> i.e. reaches the attack code, then there is nothing to see. From the
> link:
>
> Similar to the L1TF VMM mitigations, snoop-assisted L1D sampling can be
> mitigated by flushing the L1D cache between when secrets are accessed
> and when possibly malicious software runs on the same core.
>
> So the important point is to flush _before_ the attack code runs which
> involves going back to user space or guest mode.

I think there is a more generic case with HT you've highlighted below

>
> > > Even this is uninteresting:
> > >
> > > victim in -> attacker in (stays in kernel, e.g. waits for data) ->
> > > attacker out -> victim in
> > >
> >
> > Not from what I understand from the link above, the attack is a function
> > of
> > what can be snooped by another core/thread and that is a function of what
> > modified secrets are in the cache line/store buffer.
>
> Forget HT. That's not fixable by any flushing simply because there is no
> scheduling involved.
>
> CPU0 HT0 CPU0 HT1 CPU1
>
> victim1 attacker
> store secrit
> victim2
> read secrit
>
> > On return to user, we already use VERW (verw), but just return to user
> > protection is not sufficient IMHO. Based on the link above, we need to
> > clear
> > the L1D cache before it can be snooped.
>
> Again. Flush is required between store and attacker running attack
> code. The attacker _cannot_ run attack code while it is in the kernel so
> flushing L1D on context switch is just voodoo.
>
> If you want to cure the HT case with core scheduling then the scenario
> looks like this:
>
> CPU0 HT0 CPU0 HT1 CPU1
>
> victim1 IDLE
> store secrit
> -> IDLE
> attacker in victim2
> read secrit
>
> And yes, there the context switch flush on HT0 prevents it. So this can
> be part of a core scheduling based mitigation or handled via a per core
> flush request.
>
> But HT is attackable in so many ways ...

I think the reason you prefer exit to user as opposed to switch_mm (switching
task groups/threads) is that it's lower overhead, the reason I prefer switch
mm is

1. The overhead is not for all tasks, the selection of L1D flush is optional
2. It's more generic and does not make specific assumptions


>
> Thanks,
>
> tglx


Thanks for the review,
Balbir Singh.