pre12 VM doubts and patch

From: Hugh Dickins (hugh@veritas.com)
Date: Wed Sep 19 2001 - 12:57:20 EST


Andrea,

I was unable to do much testing with -pre11 because of the spurious
allocation failures, but -pre12 fixes those for my particular testing:
thank you. A few doubts: in the first case I've appended a patch; for
the rest you may prefer to make your own patch, or ask me for patch,
or reject my doubts.

1. Your changes don't quite fit with my -pre4 swapoff changes. In
   free_pages_ok you want to treat PageDirty as a BUG, so you removed
   my SetPageDirty from try_to_unuse, and inserted a ClearPageDirty
   there. Not good: that gives me hangs, or else data loss, after
   swapoff. It is a cruel and unusual use of PageDirty (we could use a
   new flag for this purpose if it deserved it), but it fixes the race
   where try_to_swapout might clear a pte in between the page being
   removed from swap cache, and the pte being marked dirty (as it used
   to be) in unuse_pte; and allowed me to speed up swapoff, by not
   bothering to mark present ptes dirty at all. See brief comment on
   PageDirty which remains in try_to_swap_out. So patch below undoes
   your changes there (as before, clears PG_referenced too: I didn't
   look very hard, but wondered if that bit became undefined when a
   page was allocated in -pre11 and -pre12). I also had more safety
   where vmscan.c goes from PageDirty to writepage, could a swapoff'ed
   page get there with NULL mapping? I've not seen it, just a doubt.

2. swap_out_mm now takes mmlist_lock while holding a page_table_lock,
   whereas try_to_unuse takes page_table_lock while holding mmlist_lock.
   Yes, try_to_unuse holds mmlist_lock for a horrible length of time (I
   just changed it over from tasklist to mmlist), and now I believe the
   SWAP_MAP_MAX case to be imaginary, that could probably be changed.
   But it does look rather odd what swap_out_mm is doing, and I wonder
   if it wouldn't be better if you changed that? You've clearly thought
   about the issues at that end, which I have not, so I'm not proposing
   a patch, but we do need to resolve the deadlock. As I understand it,
   you don't like the way the old code darted away from a large and
   fruitful mm, now it scans the whole mm before going on to the next:
   sounds plausible, though I wonder, aren't other swapping out cpus
   liable to spin on its page_table_lock when they could be attacking
   other mms? At a guess, your way works better when a few large mms
   (probably the common case?), worse if many.

3. If exclusive_swap_page, do_swap_page unconditionally makes the pte
   writable. I think that's wrong, I think that's the same error Linus
   corrected in Rik's version there: if it's come in from swap, we know
   that the page has been dirtied in the past, but that does not imply
   that writing to it is now permitted. I think you need something like
                if (write_access)
                        pte = pte_mkwrite(pte);
                pte = pte_mkdirty(pte);
                delete_from_swap_cache_nolock(page);
   (and please remove that hesitant "#if 1" and "#if 0" from memory.c).

4. In 2.4.10-pre10, Linus removed the SetPageReferenced(page) from
   __find_page_nolock, and was adamant on lkml that it's inappropriate
   at that level. Later in the day, Linus produced 2.4.10-pre11 from
   your patches, and that SetPageReferenced(page) reappeared: oversight
   or intentional? Linus? more a question for you than Andrea.

5. With -pre12 I'm not getting the 0-order allocation failures which
   interfered with my -pre11 testing, but I did spend a while yesterday
   looking into that, and the patch I found successful was to move the
   "int nr_pages = SWAP_CLUSTER_MAX;" in try_to_free_pages from within
   the loop to the outer level: try_to_free_pages had freed 114 pages
   of the zone, but never as many as 32 in any one go round the loop.
   You'll have your own ideas of what's right and wrong here, and I'd
   rather stay clear of this area, but thought the observation worth
   passing on; and note that try_to_free_pages does look like work in
   progress - unused order argument, shrink_caches(,,, nr_pages)
   instead of explicit shrink_caches(,,, SWAP_CLUSTER_MAX).

Hugh

--- 2.4.10-pre12/mm/page_alloc.c Wed Sep 19 14:08:14 2001
+++ linux/mm/page_alloc.c Wed Sep 19 16:21:46 2001
@@ -86,8 +86,7 @@
                 BUG();
         if (PageInactive(page))
                 BUG();
- if (PageDirty(page))
- BUG();
+ page->flags &= ~((1<<PG_referenced) | (1<<PG_dirty));
 
         if (current->flags & PF_FREE_PAGES)
                 goto local_freelist;
--- 2.4.10-pre12/mm/swapfile.c Wed Sep 19 14:08:14 2001
+++ linux/mm/swapfile.c Wed Sep 19 16:08:08 2001
@@ -452,6 +452,7 @@
                 lock_page(page);
                 if (PageSwapCache(page))
                         delete_from_swap_cache_nolock(page);
+ SetPageDirty(page);
                 UnlockPage(page);
                 flush_page_to_ram(page);
 
@@ -492,7 +493,6 @@
                         mmput(start_mm);
                         start_mm = new_start_mm;
                 }
- ClearPageDirty(page);
                 page_cache_release(page);
 
                 /*

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Sep 23 2001 - 21:00:31 EST