Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges

From: Christoph Lameter
Date: Tue Jan 29 2008 - 16:53:22 EST


On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

> > We invalidate the range *after* populating it? Isnt it okay to establish
> > references while populate_range() runs?
>
> It's not ok because that function can very well overwrite existing and
> present ptes (it's actually the nonlinear common case fast path for
> db). With your code the sptes created between invalidate_range and
> populate_range, will keep pointing forever to the old physical page
> instead of the newly populated one.

Seems though that the mmap_sem is taken for regular vmas writably and will
hold off new mappings.

> I'm also asking myself if it's a smp race not to call
> mmu_notifier(invalidate_page) between ptep_clear_flush and set_pte_at
> in install_file_pte. Probably not because the guest VM running in a
> different thread would need to serialize outside the install_file_pte
> code with the task running install_file_pte, if it wants to be sure to
> write either all its data to the old or the new page. Certainly doing
> the invalidate_page inside the PT lock was obviously safe but I hope
> this is safe and this can accommodate your needs too.

But that would be doing two invalidates on one pte. One range and one page
invalidate.

> > > diff --git a/mm/memory.c b/mm/memory.c
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1639,8 +1639,6 @@ gotten:
> > > /*
> > > * Re-check the pte - we dropped the lock
> > > */
> > > - mmu_notifier(invalidate_range, mm, address,
> > > - address + PAGE_SIZE - 1, 0);
> > > page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> > > if (likely(pte_same(*page_table, orig_pte))) {
> > > if (old_page) {
> >
> > What we did is to invalidate the page (?!) before taking the pte lock. In
> > the lock we replace the pte to point to another page. This means that we
> > need to clear stale information. So we zap it before. If another reference
> > is established after taking the spinlock then the pte contents have
> > changed at the cirtical section fails.
> >
> > Before the critical section starts we have gotten an extra refcount on the
> > original page so the page cannot vanish from under us.
>
> The problem is the missing invalidate_page/range _after_
> ptep_clear_flush. If a spte is built between invalidate_range and
> pte_offset_map_lock, it will remain pointing to the old page
> forever. Nothing will be called to invalidate that stale spte built
> between invalidate_page/range and ptep_clear_flush. This is why for
> the last few days I kept saying the mmu notifiers have to be invoked
> _after_ ptep_clear_flush and never before (remember the export
> notifier?). No idea how you can deal with this in your code, certainly
> for KVM sptes that's backwards and unworkable ordering of operation
> (exactly as backwards are doing the tlb flush before pte_clear in
> ptep_clear_flush, think spte as a tlb, you can't flush the tlb before
> clearing/updating the pte or it's smp unsafe).

Hmmm... So we could only do an invalidate_page here? Drop the strange
invalidate_range()?

>
> > > @@ -1676,6 +1674,8 @@ gotten:
> > > page_cache_release(old_page);
> > > unlock:
> > > pte_unmap_unlock(page_table, ptl);
> > > + mmu_notifier(invalidate_range, mm, address,
> > > + address + PAGE_SIZE - 1, 0);
> > > if (dirty_page) {
> > > if (vma->vm_file)
> > > file_update_time(vma->vm_file);
> >
> > Now we invalidate the page after the transaction is complete. This means
> > external pte can persist while we change the pte? Possibly even dirty the
> > page?
>
> Yes, and the only reason this can be safe is for the reason explained
> at the top of the email, if the other cpu wants to serialize to be
> sure to write in the "new" page, it has to serialize with the
> page-fault but to serialize it has to wait the page fault to return
> (example: we're not going to call futex code until the page fault
> returns).

Serialize how? mmap_sem?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/