Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration

From: Nadav Amit
Date: Fri Nov 10 2023 - 17:20:10 EST




> On Nov 10, 2023, at 5:13 AM, Byungchul Park <byungchul@xxxxxx> wrote:
>
>
>> Here,
>>
>> mprotect()
>> do_mprotect_pkey()
>> tlb_finish_mmu()
>> tlb_flush_mmu()
>>
>> I thought TLB flush for mprotect() is performed by tlb_flush_mmu() so
>> any cached TLB entries on other CPUs can have chance to update. Could
>> you correct me if I get it wrong? Thanks.
>
> I guess you tried to inform me that x86 mmu automatically keeps the
> consistancy based on cached TLB entries. Right? If yes, I should do
> something on that path. If not, it's not problematic. Thoughts?

Perhaps I lost something in this whole scheme. Overall, I find it overly
complicated and somewhat hard to follow.

Ideally, a solution should reduce the number of TLB flushing mechanisms
and not introduce yet another one that would further increase the already
high complexity.

Anyhow, I a bit convoluted 2 scenarios, and I’m not sure whether they
are a potential problem.

(1) Assume there is a RO page migration and you skipped a TLB flush.
Then you set a RO PTE entry for that page. Afterwards, you have mprotect()
that updates the PTE for that page to be RW.

Now, tlb_finish_mmu() will do a TLB flush eventually in the mprotect()
flow, but until it is done, you might have one CPU have RO pointing to
the source page (no TLB flush, right?) and another having RW access
that were loaded from the updated PTE. Did I miss a TLB flush that
should take place beforehand?


(2) Indeed we encountered many problems when TLB flushing decisions
are based on PTEs that are read from the page-tables and those do
not reflect the values that are held in the TLBs.


Overall, I think that a solution is to consolidate the TLB batching
mechanisms and hold per a group of PTEs (e.g., 512 PTEs held in one
page-table) the deferred TLB flush generation. Track the “done” TLB
flush generation, if the deferred is greater than the “done”, perform
a TLB flush.

This scheme is a bit similar to what you were doing, but more general,
and easier to follow. I think that its value might be more in simplifying
the code and reasoning than just performance.

IIRC, it kind of reminds the FreeBSD deferred TLB flushing scheme.