Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case ofvma_merge succeeding in copy_vma

From: Andrea Arcangeli
Date: Thu Nov 17 2011 - 21:17:57 EST


On Fri, Nov 18, 2011 at 09:42:05AM +0800, Nai Xia wrote:
> First of all, I believe that at the POSIX level, it's ok for
> truncate_inode_page()
> not scanning COWed pages, since basically we does not provide any guarantee
> for privately mapped file pages for this behavior. But missing a file
> mapped pte after its
> cache page is already removed from the the page cache is a

I also exclude there is a case that would break, but it's safer to
keep things as is, in case somebody depends on segfault trapping.

> fundermental malfuntion for
> a shared mapping when some threads see the file cache page is gone
> while some thread
> is still r/w from/to it! No matter how short the gap between
> truncate_inode_page() and
> the second loop, this is wrong.

Truncate will destroy the info on disk too... so if somebody is
writing to a mapping which points beyond the end of the i_size
concurrently with truncate, the result is undefined. The write may
well reach the page but then the page is discared. Or you may get
SIGBUS before the write.

> Second, even if the we don't care about this POSIX flaw that may
> introduce, a pte can still
> missed by the second loop. mremap can happen serveral times during
> these non-atomic
> firstpass-trunc-secondpass operations, a proper events can happily
> make the wrong order
> for every scan, and miss them all -- That's just what in Hugh's mind
> in the post you just
> replied. Without lock and proper ordering( which patial mremap cannot provide),
> this *will* happen.

There won't be more than one mremap running concurrently from the same
process (we must enforce it by making sure anon_vma lock and
i_mmap_lock are both taken at least once in copy_vma, they're already
both taken in fork, they should already be taken in all common cases
in copy_vma so for all cases it's going to be a L1 exclusive cacheline
already). I don't exclude there may be some case that won't take the
locks in vma_adjust though, we should check it, if we decide to relay
on the double loop, but it'd be a simple addition if needed.

I'm more concerned about the pte pointing to the orphaned pagecache
that would materialize for a little while because of
unmap+truncate+unmap instead of unmap+unmap+truncate (but the latter
order is needed for the COWs).

> You may disagree with me and have that locking removed, and I am
> already have that
> one line patch prepared waiting fora bug bumpping up again, what a
> cheap patch submission!

Well I'm not yet sure it's good idea to remove the i_mmap_mutex, or if
we should just add the anon_vma lock in mremap and add the i_mmap_lock
in fork (to avoid the orphaned pagecache left mapped in the child
which already may happen unless there's some i_mmap_lock belonging to
the same inode taken after copy_page_range returns until we return to
userland and child can run, and I don't think we can relay on the
order of the prio tree in fork. Fork is safe for anon pages because
there we can relay on the order of the same_anon_vma list.

I think clearing up if this orphaned pagecache is dangerous would be a
good start. If too complex we just add the i_mmap_lock around
copy_page_range in fork if vma->vm_file is set. If you instead think
we can deal with the orphaned pagecache we can add a dummy lock/unlock
of i_mmap_mutex in copy_vma vma_merge succeeding case (short critical
section and not common common case) and remove the i_mmap_mutex around
move_page_tables (common case) overall speeding up mremap and not
degrading fork.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/