Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

From: Andy Lutomirski
Date: Mon Jun 19 2017 - 17:59:23 EST


On Sun, Jun 18, 2017 at 1:06 AM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote:
>
>> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>
>> x86's lazy TLB mode used to be fairly weak -- it would switch to
>> init_mm the first time it tried to flush a lazy TLB. This meant an
>> unnecessary CR3 write and, if the flush was remote, an unnecessary
>> IPI.
>>
>> Rewrite it entirely. When we enter lazy mode, we simply remove the
>> cpu from mm_cpumask. This means that we need a way to figure out
>> whether we've missed a flush when we switch back out of lazy mode.
>> I use the tlb_gen machinery to track whether a context is up to
>> date.
>>
>> Note to reviewers: this patch, my itself, looks a bit odd. I'm
>> using an array of length 1 containing (ctx_id, tlb_gen) rather than
>> just storing tlb_gen, and making it at array isn't necessary yet.
>> I'm doing this because the next few patches add PCID support, and,
>> with PCID, we need ctx_id, and the array will end up with a length
>> greater than 1. Making it an array now means that there will be
>> less churn and therefore less stress on your eyeballs.
>>
>> NB: This is dubious but, AFAICT, still correct on Xen and UV.
>> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
>> patch changes the way that mm_cpumask() works. This should be okay,
>> since Xen *also* iterates all online CPUs to find all the CPUs it
>> needs to twiddle.
>>
>> The UV tlbflush code is rather dated and should be changed.
>>
>> Cc: Andrew Banman <abanman@xxxxxxx>
>> Cc: Mike Travis <travis@xxxxxxx>
>> Cc: Dimitri Sivanich <sivanich@xxxxxxx>
>> Cc: Juergen Gross <jgross@xxxxxxxx>
>> Cc: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
>> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx>
>> ---
>> arch/x86/include/asm/mmu_context.h | 6 +-
>> arch/x86/include/asm/tlbflush.h | 4 -
>> arch/x86/mm/init.c | 1 -
>> arch/x86/mm/tlb.c | 242 +++++++++++++++++++------------------
>> 4 files changed, 131 insertions(+), 122 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
>> index e5295d485899..69a4f1ee86ac 100644
>> --- a/arch/x86/include/asm/mmu_context.h
>> +++ b/arch/x86/include/asm/mmu_context.h
>> @@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
>>
>> static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
>> {
>> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
>> - this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
>> + int cpu = smp_processor_id();
>> +
>> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
>> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
>
> The indication for laziness that was in cpu_tlbstate.state may be a better
> indication whether the cpu needs to be cleared from the previous mm_cpumask().
> If you kept this indication, you could have used this per-cpu information in
> switch_mm_irqs_off() instead of "cpumask_test_cpu(cpu, mm_cpumask(next))â,
> which might have been accessed by another core.

Hmm, fair enough. On the other hand, this is the least of our
problems in this particular case -- the scheduler's use of mmgrab()
and mmdrop() are probably at least as bad if not worse. My preference
would be to get all this stuff merged and then see if we want to add
some scalability improvements on top.