Re: [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits erratum

From: Vlastimil Babka
Date: Wed Jul 13 2016 - 07:37:56 EST


On 07/02/2016 12:28 AM, Benjamin Herrenschmidt wrote:
On Fri, 2016-07-01 at 10:46 -0700, Dave Hansen wrote:
The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
Landing) has an erratum where a processor thread setting the Accessed
or Dirty bits may not do so atomically against its checks for the
Present bit. This may cause a thread (which is about to page fault)
to set A and/or D, even though the Present bit had already been
atomically cleared.

Interesting.... I always wondered where in the Intel docs did it specify
that present was tested atomically with setting of A and D ... I couldn't
find it.

Isn't there a more fundamental issue however that you may actually lose
those bits ? For example if we do an munmap, in zap_pte_range()

We first exchange all the PTEs with 0 with ptep_get_and_clear_full()
and we then transfer D that we just read into the struct page.

We rely on the fact that D will never be set again, what we go it a
"final" D bit. IE. We rely on the fact that a processor either:

- Has a cached PTE in its TLB with D set, in which case it can still
write to the page until we flush the TLB or

- Doesn't have a cached PTE in its TLB with D set and so will fail
to do so due to the atomic P check, thus never writing.

With the errata, don't you have a situation where a processor in the second
category will write and set D despite P having been cleared (due to the
race) and thus causing us to miss the transfer of that D to the struct
page and essentially completely miss that the physical page is dirty ?

Seems to me like this is indeed possible, but...

(Leading to memory corruption).

... what memory corruption, exactly? If a process is writing to its memory from one thread and unmapping it from other thread at the same time, there are no guarantees anyway? Would anything sensible rely on the guarantee that if the write in such racy scenario didn't end up as a segfault (i.e. unmapping was faster), then it must hit the disk? Or are there any other scenarios where zap_pte_range() is called? Hmm, but how does this affect the page migration scenario, can we lose the D bit there?

And maybe related thing that just occured to me, what if page is made non-writable during fork() to catch COW? Any race in that one, or just the P bit? But maybe the argument would be the same as above...