Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

From: Andy Lutomirski
Date: Tue Jul 17 2018 - 18:28:24 EST




> On Jul 17, 2018, at 12:05 PM, Rik van Riel <riel@xxxxxxxxxxx> wrote:
>
>
>
>> On Jul 17, 2018, at 5:29 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>
>> On Tue, Jul 17, 2018 at 1:16 PM, Rik van Riel <riel@xxxxxxxxxxx> wrote:
>>> Can I skip both the cr4 and let switches when the TLB contents
>>> are no longer valid and got reloaded?
>>>
>>> If the TLB contents are still valid, either because we never went
>>> into lazy TLB mode, or because no invalidates happened while
>>> we were lazy, we immediately return.
>>>
>>> The cr4 and ldt reloads only happen if the TLB was invalidated
>>> while we were in lazy TLB mode.
>>
>> Yes, since the only events that would change the LDT or the required
>> CR4 value will unconditionally broadcast to every CPU in mm_cpumask
>> regardless of whether they're lazy. The interesting case is that you
>> go lazy, you miss an invalidation IPI because you were lazy, then you
>> go unlazy, notice the tlb_gen change, and flush. If this happens, you
>> know that you only missed a page table update and not an LDT update or
>> a CR4 update, because the latter would have sent the IPI even though
>> you were lazy. So you should skip the CR4 and LDT updates.
>>
>> I suppose a different approach would be to fix the issue below and to
>> try to track when the LDT actually needs reloading. But that latter
>> part seems a bit complicated for minimal gain.
>>
>> (Do you believe me? If not, please argue back!)
>>
> I believe you :)
>
>>>> Hmm. load_mm_cr4() should bypass itself when mm == &init_mm. Want to
>>>> fix that part or should I?
>>>
>>> I would be happy to send in a patch for this, and one for
>>> the above optimization you pointed out.
>>>
>>
>> Yes please!
>>
> There is a third optimization left to do. Currently every time
> we switch into lazy tlb mode, we take a refcount on the mm,
> even when switching from one kernel thread to another, or
> when repeatedly switching between the same mm and kernel
> threads.
>
> We could keep that refcount (on a per cpu basis) from the time
> we first switch to that mm in lazy tlb mode, to when we switch
> the CPU to a different mm.
>
> That would allow us to not bounce the cache line with the
> mm_struct reference count on every lazy TLB context switch.
>
> Does that seem like a reasonable optimization?

Are you referring to the core sched code that deals with mm_count and active_mm? If so, last time I looked at it, I convinced myself that it was totally useless, at least on x86. I think the my reasoning was that, when mm_users went to zero, we already waited for RCU before tearing down page tables.

Things may have changed, but I strongly suspect that it should be possibly for at least x86 to opt out of mm_count and maybe even active_mm entirely. If nothing else, youâre shooting the mm out of CR3 on all CPUs whenever the pagetables get freed, and more or less the same logic should be sufficient so that, whenever mm_users hits zero, we can synchronously or via RCU callback kill the mm entirely.

Want to take a look at that?

>
> Am I overlooking anything?
>
> I'll try to get all three optimizations working, and will run them
> through some testing here before posting upstream.
>