Re: [PATCH] KVM: VMX: remove LFENCE in vmx_spec_ctrl_restore_host()

From: Andrew Cooper
Date: Mon Jun 05 2023 - 20:22:25 EST


On 05/06/2023 5:35 pm, Josh Poimboeuf wrote:
> On Mon, Jun 05, 2023 at 02:29:02PM +0000, Jon Kohler wrote:
>> Coming back to this, what if we hoisted call vmx_spec_ctrl_restore_host above
>> FILL_RETURN_BUFFER, and dropped this LFENCE as I did here?
>>
>> That way, we wouldn’t have to mess with the internal LFENCE in nospec-branch.h,
>> and that would act as the “final line of defense” LFENCE.
>>
>> Would that be acceptable? Or does FILL_RETURN_BUFFER *need* to occur
>> before any sort of calls no matter what?
> If we go by Intel's statement that only unbalanced RETs are a concern,
> that *might* be ok as long as there's a nice comment above the
> FILL_RETURN_BUFFER usage site describing the two purposes for the
> LFENCE.
>
> However, based on Andy's concerns, which I've discussed with him
> privately (but I'm not qualified to agree or disagree with), we may want
> to just convert vmx_spec_ctrl_restore_host() to asm. Better safe than
> sorry. My original implementation of that function was actually asm. I
> can try to dig up that code.

Lemme see if I can remember the whole safely position.  I've just gone
through a years worth of notes and conversations, including the
following gems:

>From Intel: "And then on top of that, the LFENCE in
vmx_spec_ctrl_restore_host would save you. Fingers crossed."
>From me: "The how and why is important.  Not for right now, but for the
years to come when all of this logic inevitably gets
tweaked/refactored/moved".

Date-check says 11 months...


__vmx_vcpu_run() is a leaf function as far as the kernel stack is
concerned, so to a first approximation, it behaves as such:

    VMEXIT
    RET

The RET gets a prediction from the top of the RSB (a guest controlled
value), but during execute the prediction is discovered to be wrong so a
resteer occurs, causing it to restart from __vmx_vcpu_run()'s caller.

Skipping the middle-ground with a 32-entry RSB stuffling loop, we have
the following behaviour on eIBRS parts:

    VMEXIT (flush RSB if IBRS was 1 in guest)
    Restore Host MSR_SPEC_CTRL (flush RSB if IBRS was 0 in guest)
    RET

where the RET (in theory) encounters an RSB-empty condition and falls
back to an indirect prediction.

PBRSB is a missing corner case in the hardware RSB flush, which is only
corrected after one CALL instruction architecturally retires.

The problem with mitigating however is that it is heavily entangled with
BCBS, so I refer you to
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/analysis-speculative-execution-side-channels.html#inpage-nav-undefined-1-undefined
which describes one example of how RETs can go wrong.

The critical observation is that while for PBRSB it's the first
unbalanced RET which causes problem, as soon as *any* RET has executed
and got a bad resteer (even temporarily), you've lost all ability to
contain the damage.  So, to protect against PBRSB, one CALL instruction
must retire before *any* RET instruction executes.

Pawan's patch to turn the unconditional lfence into an else-lfence
should be safe seeing as Intel's guidance
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/post-barrier-return-stack-buffer-predictions.html
explicitly says you can use the WRMSR to restore the host MSR_SPEC_CTRL
value as the speculation barrier.

But, the safety of vmx_spec_ctrl_restore_host() in the first place
depends on the early return never ever becoming a conditional, and the
compiler never emitting a call to memcpy()/memset()/whatever behind your
back - something which is not prohibited by noinstr.

I hope this clears things up.

~Andrew