Re: Linux 1.3.36: 'bforget' is flawed

Linus Torvalds (Linus.Torvalds@cs.helsinki.fi)
Mon, 6 Nov 1995 17:58:55 +0200


Michael Elizabeth Chastain: "Re: Linux 1.3.36: 'bforget' is flawed" (Nov 6, 8:34):
> Linus Torvalds writes:
>
> > _However_, the mmap exit code actually does things the wrong way. It
> > _first_ releases the inode, and only _then_ it unmaps the (potentially
> > shared) pages.
>
> I was wondering about this! I will try re-ordering this today and see
> if it helps. If it does, I will send a patch. Of course you may beat
> me to it and your patches have more authority than mine. :)

I do have patches for this now, but I'll keep them for a while pending a
few cleanups. I'm appending the current working set in this mail, BUT
BE VERY CAREFUL. I haven't actually tested this (and my final patches
should actually do some things slightly differently).

> > Actually, "bforget()" assumes that the buffers _can't_ be shared with
> > memory maps, and that is enforced by "truncate()" etc with the
> > "vmtruncate()" code. Similarly, when removing the file with "unlink()",
> > the file is actually removed only after it is no longer used, so there
> > cannot in theory be any shared pages.
>
> Check me on this, please:
>
> In ext2, I see the following call stack:
>
> ext2_unlink
> iput
> ext2_put_inode
> ext2_truncate
> trunc_direct and trunc_idirect
> bforget
>
> 'vmtruncate' is not called in this path.

No, vmtruncate() shouldn't be called. It's called only for the actual
"truncate()" system calls, where we're actually changing the size of a
file that is currently active.

But the point with "ext2_unlink()" is that it should never call
"ext2_put_inode()" until the inode is no longer dereferenced anywhere,
so there can't be any mappings active when we get the trunc_direct etc..

The problem is that we do a few "iput()" calls in the memory management
code before we are actually completely done with the inode, so currently
the theory is only theory and practise is something else.. My patches
try to fix this..

> But in this call stack, 'iput' does not call 'ext2_put_inode' until
> 'inode->i_count' goes to zero. So it's ok for 'ext2_put_inode' to call
> 'ext2_truncate' without calling 'vmtruncate'. And it's really bad for
> 'unmap_fixup' and 'exit_mmap' to be calling 'iput' while they still have
> pages which could be shared with that inode in the buffer cache.

Exactly. The patch below tries to fix some of this, but as I already
warned: this is a work in progress and I won't guarantee that it
compiles (and if it compiles I won't guarantee that it works..)

Linus

----------
diff -u --recursive --new-file v1.3.37/linux/mm/filemap.c linux/mm/filemap.c
--- v1.3.37/linux/mm/filemap.c Mon Oct 23 18:02:20 1995
+++ linux/mm/filemap.c Mon Nov 6 16:05:07 1995
@@ -279,7 +279,7 @@
}

/*
- * This handles partial area unmaps..
+ * This handles (potentially partial) area unmaps..
*/
static void filemap_unmap(struct vm_area_struct *vma, unsigned long start, size_t len)
{
@@ -287,23 +287,15 @@
}

/*
- * This handles complete area closes..
- */
-static void filemap_close(struct vm_area_struct * vma)
-{
- filemap_sync(vma, vma->vm_start, vma->vm_end - vma->vm_start, MS_ASYNC);
-}
-
-/*
* Shared mappings need to be able to do the right thing at
* close/unmap/sync. They will also use the private file as
* backing-store for swapping..
*/
static struct vm_operations_struct file_shared_mmap = {
- NULL, /* open */
- filemap_close, /* close */
- filemap_unmap, /* unmap */
- NULL, /* protect */
+ NULL, /* no special open */
+ NULL, /* no special close */
+ filemap_unmap, /* unmap - we need to sync the pages */
+ NULL, /* no special protect */
filemap_sync, /* sync */
NULL, /* advise */
filemap_nopage, /* nopage */
diff -u --recursive --new-file v1.3.37/linux/mm/mmap.c linux/mm/mmap.c
--- v1.3.37/linux/mm/mmap.c Wed Sep 13 12:45:34 1995
+++ linux/mm/mmap.c Mon Nov 6 16:02:38 1995
@@ -768,7 +768,7 @@

if (mpnt->vm_ops && mpnt->vm_ops->unmap)
mpnt->vm_ops->unmap(mpnt, st, end-st);
-
+ zap_page_range(current->mm, addr, end-addr);
unmap_fixup(mpnt, st, end-st);
kfree(mpnt);
}
@@ -797,12 +797,16 @@
mm->mmap_avl = NULL;
while (mpnt) {
struct vm_area_struct * next = mpnt->vm_next;
- if (mpnt->vm_ops && mpnt->vm_ops->close)
- mpnt->vm_ops->close(mpnt);
+ if (mpnt->vm_ops) {
+ if (mpnt->vm_ops->unmap)
+ mpnt->vm_ops->unmap(mpnt, mpnt->vm_start, mpnt->vm_end-mpnt->vm_start);
+ if (mpnt->vm_ops->close)
+ mpnt->vm_ops->close(mpnt);
+ }
remove_shared_vm_struct(mpnt);
+ zap_page_range(mm, mpnt->vm_start, mpnt->vm_end-mpnt->vm_start);
if (mpnt->vm_inode)
iput(mpnt->vm_inode);
- zap_page_range(mm, mpnt->vm_start, mpnt->vm_end-mpnt->vm_start);
kfree(mpnt);
mpnt = next;
}
----------