Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process

From: Andy Lutomirski
Date: Thu Jan 25 2018 - 12:04:49 EST


On Thu, Jan 25, 2018 at 8:41 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Thu, Jan 25, 2018 at 06:07:07AM -0800, Arjan van de Ven wrote:
>> On 1/25/2018 5:50 AM, Peter Zijlstra wrote:
>> > On Thu, Jan 25, 2018 at 05:21:30AM -0800, Arjan van de Ven wrote:
>> > > >
>> > > > This means that 'A -> idle -> A' should never pass through switch_mm to
>> > > > begin with.
>> > > >
>> > > > Please clarify how you think it does.
>> > > >
>> > >
>> > > the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states
>> > > for a tlb flush.
>> >
>> > The intel_idle code does, not the idle code. This is squirreled away in
>> > some driver :/
>>
>> afaik (but haven't looked in a while) acpi drivers did too
>
> Only makes it worse.. drivers shouldn't be frobbing with things like
> this.
>
>> > > (trust me, that you really want, sequentially IPI's a pile of cores in a deep sleep
>> > > state to just flush a tlb that's empty, the performance of that is horrific)
>> >
>> > Hurmph. I'd rather fix that some other way than leave_mm(), this is
>> > piling special on special.
>> >
>> the problem was tricky. but of course if something better is possible lets figure this out
>
> How about something like the below? It boots with "nopcid" appended to
> the cmdline.
>
> Andy, could you pretty please have a look at this? This is fickle code
> at best and I'm sure I messed _something_ up.
>
> The idea is simple, do what we do for virt. Don't send IPI's to CPUs
> that don't need them (in virt's case because the vCPU isn't running, in
> our case because we're not in fact running a user process), but mark the
> CPU as having needed a TLB flush.

I haven't tried to fully decipher the patch, but I think the idea is
wrong. (I think it's the same wrong idea that Rik and I both had and
that I got into Linus' tree for a while...) The problem is that it's
not actually correct to run indefinitely in kernel mode using stale
cached page table data. The stale PTEs themselves are fine, but the
stale intermediate translations can cause the CPU to speculatively
load complete garbage into the TLB, and that's bad (and causes MCEs on
AMD CPUs).

I think we only really have two choices: tlb_defer_switch_to_init_mm()
== true and tlb_defer_switch_to_init_mm() == false. The current
heuristic is to not defer if we have PCID, because loading CR3 is
reasonably fast.



> void native_flush_tlb_others(const struct cpumask *cpumask,
> const struct flush_tlb_info *info)
> {
> + struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__tlb_mask);
> +
> count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
> if (info->end == TLB_FLUSH_ALL)
> trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
> @@ -531,6 +543,19 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
> (void *)info, 1);
> return;
> }
> +
> + if (tlb_defer_switch_to_init_mm() && flushmask) {
> + int cpu;
> +
> + cpumask_copy(flushmask, cpumask);
> + for_each_cpu(cpu, flushmask) {
> + if (cmpxchg(per_cpu_ptr(&cpu_tlbstate.is_lazy, cpu), 1, 2) >= 1)
> + __cpumask_clear_cpu(cpu, flushmask);

If this code path here executes and we're flushing because we just
removed a reference to a page table and we're about to free the page
table, then the CPU that we didn't notify by IPI can start using
whatever gets written to the pagetable after it's freed, and that's
bad :(

> + }
> +
> + cpumask = flushmask;
> + }
> +
> smp_call_function_many(cpumask, flush_tlb_func_remote,
> (void *)info, 1);
> }