Re: [parisc-linux] Re: [PATCH 3/9] mm: parisc pte atomicity

From: Hugh Dickins
Date: Sun Oct 23 2005 - 23:48:56 EST


On Sun, 23 Oct 2005, James Bottomley wrote:
> On Sun, 2005-10-23 at 10:02 +0100, Hugh Dickins wrote:
>
> The change does slightly worry me in that it alters the behaviour of
> flush_cache_page() because now it checks the pfn whereas previously it
> didn't. This means that previously we would flush the COW'd page of a
> shared mapping, now we won't.

Perhaps you're thinking of some use of flush_cache_page that no longer
exists in the tree? All the uses I see are passing in the right pfn,
and the question is why translation_exists even gets called by it.

Even those uses of flush_cache_page in asm-parisc/cacheflush.h itself
(which I'd missed before): copy_to_user_page and copy_from_user_page
are only used by nothing but access_process_vm, after get_user_pages:
so any COW has already been done, and the right pfn is passed in.

> > I'm right, aren't I? that the previous pte_none test was actually letting
> > through swap entries and pte_file entries which might look as if they had
> > the right pfn, just by coincidence of their offsets? So flush_dcache_page
> > would stop, thinking it had done a good flush, when actually it hadn't.
>
> Actually, no, pte_none() on parisc is either pte is zero or _PAGE_FLUSH
> (which is our private flag saying we're preserving the pte entry for the
> purposes of flushing even though it's gone) is set.

Sorry, "pte_none test" was my lazy shorthand for the actual test:
if(pte_none(*pte) && ((pte_val(*pte) & _PAGE_FLUSH) == 0))
return NULL;
from which point on it assumes the pte is valid. I'm contending that
a swap entry or pte_file entry does not return NULL there, so is then
treated as a valid pte: and pte_pfn on it might match the target pfn.

I believe my
/* Filter out coincidental file entries and swap entries */
if (!(pte_val(pte) & (_PAGE_FLUSH|_PAGE_PRESENT)))
return 0;
is correct, and avoids that problem (plus avoiding the complication
of parisc's unintuitive pte_none).

> However, now that I
> look at it, are you thinking our ptep_get_and_clear() should be doing a
> check for _PAGE_PRESENT before it sets _PAGE_FLUSH?

That's certainly not what I was thinking. No, I'm pretty sure that
ptep_get_and_clear only gets called when we've got the right lock and
know the pte is present: I don't think you need to make any change there.
(The "pretty sure" reflecting that there's a bit of a macro maze here now,
so actually following up each reference would take a bit longer.)

But you may be seeing something I don't see: I've only just met your
_PAGE_FLUSH, and that unusual pte_none might be letting something
through that I'm overlooking, that you're aware of.

> > But races remain: the pte good at the moment translation_exists checks it,
> > may have been taken out by kswapd a moment later (flush_dcache_mmap_lock
> > only secures the vma_prio_tree structure, not ptes within the vmas);
> > or it may have been COWed in the interval; or zapped from the mm.
> >
> > Can you get a success code out of __flush_cache_page? Or perhaps you
> > should run translation_exists a second time after the __flush_cache_page,
> > to check nothing changed (the pte pointer would then be helpful, to save
> > descending a second time)? Or perhaps it all works out, that any such
> > change which might occur, itself does the __flush_cache_page you need?
>
> Yes, I know ... I never liked the volatility of this, but it's better
> than what was there before, believe me (previously we just flushed the
> first entry we found regardless).

I do believe you!

> Getting a return code out of __flush_dcache_page() is hard because it
> doesn't know if the tlb miss handler nullified the instructions it's
> trying to execute; and they're interruption handlers (meaning we don't
> push anything on the stack for them, they run in shadow registers), so
> getting a return code out of them is next to impossible.

Right, it was just a thought.

> For the flush to be effective in the VIPT cache, we have to have a
> virtual address with a valid translation that points to the correct
> physical address. I suppose we could flush through the tmpalias space
> instead. That's where we set up an artifical translation that's not the
> actual vaddr but instead is congruent to the vaddr so the mapping is
> effective in cache aliasing terms. Our congruence boundary is 4MB, so
> we set up a private (per cpu) 4MB space (tmpalias) where we can set up
> our pte's (or actually manufacture them in the tlb miss handler)
> securely.

I have no appreciation at all how that would compare. On the one hand
it sounds like a lot of overhead you'd understandably prefer to avoid;
on the other hand, this hit-and-miss search of the vma_prio_tree might
be a worse overhead better avoided. I've no appreciation at all.

But I made two suggestions above that might be less work. One,
requiring no work but research, that _perhaps_ in all the racy cases
you can rely on the __flush_cache_page having been done by the racer?
I can't tell, and it may be obvious to you that that's a non-starter.

Other, that you just check the pte again after the __flush_cache_page,
if it's no longer right then assume the flush didn't work and continue.
There is a small chance that the pte was right before, wrong during the
flush, then right after (something else faulted it back in, while we
were away... I was about to say handling an interrupt, but actually
even interrupts are disabled here by flush_dcache_mmap_lock, since
that's a reuse of mapping->tree_lock). But it would improve the
safety by several orders of magnitude.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/