Re: [PATCH] __free_pages_ok oops

From: Andrea Arcangeli (andrea@suse.de)
Date: Thu Feb 14 2002 - 08:10:28 EST


On Thu, Feb 14, 2002 at 12:10:37PM +0100, Gerd Knorr wrote:
> > However: that is the only unambiguous example I've seen, and you
> > may argue that his bttv 0.8 driver is not in the current 2.4 tree,
> > is experimental, and even wrong in that area (we now know it also
> > vfrees there).
>
> I've recently changed the code to make it *not* call unmap_kiobuf/vfree
> from irq context. Instead bttv 0.8.x doesn't allow you to close the
> device with DMA xfers in flight. If you try this the release() fops
> handler will block until the transfer is done, then unmap_kiobuf from
> process context, then return.

perfect, that's the right fix for 2.4 (waiting DMA to complete at
->release looks also much saner). unmap_kiobuf wasn't supposed to be run
from irq handlers. Everything dealing with userspace mappings cannot run
from irq handlers, tlb flushes, VM, swapping etc... everything must run
from normal kernel context. If you obey this rule, my previous email to
this thread will still apply. I wasn't aware of bttv running
unmap_kiobuf from irq.

With aio in 2.5 we may want to change this property for the unpinning
stage that would be better run asynchronously from irq handlers, but I
wouldn't change that for 2.4 (at least until we're forced to ship aio in
production on top 2.4, that cannot happen until a final user<->kernel API is
registered somewhere).

I think the foundamental design mistake that leads to __free_pages to
fail from irq, is that we allow an anonymous page to reach count 0 and to be
still in the LRU (the count == 0 check in shrink_cache is the other side
of the hack too). That's the real BUG, that breaks subtly the freelist
semantics, and then we need to make horrible hacks like last Hugh's
patch to work around such magic case (or even worse Rik's proposal for a
spin_lock_irqed list that would hurt in all the vm fast paths). As far
as clean design and orthogonality of subsystem is concerned, the right
fix for 2.5 is to bump the page->count by the time the anonymous page is
added to the lru (think and guess why we're doing that for the
pagecache, and why the pagecache is obviously safe even for aio and
unmap_kiobuf from irq). Then we need to keep it into account during
COWs (page count == 2 for an anonymous page will mean "exclusive")
etc... the MM will need to be changed a little more heavily than with
the hack approch, but I think that's the clean design in the long run.
No special cases for those magic anonymous pages, everything goes in
pagecache, with the difference the anonymous pages aren't hashed (until
they becomes swapcache at least). The semantics of __free_pages will
remain that if you own a page you can __free_pages it anytime you want
without running into BUGS(). If the page also owned by some other
subsystem (the VM), such subsystem will need to take care of bumping the
refernece count and to free the page later lazily. No collisions.

As said this should be a matter only for 2.5, now that Gerd recalls
unmap_kiobuf from normal kernel context.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Feb 15 2002 - 21:01:01 EST