Re: [PATCH -v2] mm: do batched scans for mem_cgroup

From: Wu Fengguang
Date: Thu Aug 20 2009 - 07:50:09 EST


On Thu, Aug 20, 2009 at 07:01:21PM +0800, Minchan Kim wrote:
> Hi, Wu.
>
> On Thu, Aug 20, 2009 at 1:05 PM, Wu Fengguang<fengguang.wu@xxxxxxxxx> wrote:
> > On Thu, Aug 20, 2009 at 11:13:47AM +0800, KAMEZAWA Hiroyuki wrote:
> >> On Thu, 20 Aug 2009 10:49:29 +0800
> >> Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:
> >>
> >> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> >> > in which case shrink_list() _still_ calls isolate_pages() with the much
> >> > larger SWAP_CLUSTER_MAX. ÂIt effectively scales up the inactive list
> >> > scan rate by up to 32 times.
> >> >
> >> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> >> > So when shrink_zone() expects to scan 4 pages in the active/inactive
> >> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
> >> >
> >> > The accesses to nr_saved_scan are not lock protected and so not 100%
> >> > accurate, however we can tolerate small errors and the resulted small
> >> > imbalanced scan rates between zones.
> >> >
> >> > This batching won't blur up the cgroup limits, since it is driven by
> >> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> >> > decides to cancel (and save) one smallish scan, it may well be called
> >> > again to accumulate up nr_saved_scan.
> >> >
> >> > It could possibly be a problem for some tiny mem_cgroup (which may be
> >> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> >> >
> >> > CC: Rik van Riel <riel@xxxxxxxxxx>
> >> > CC: Minchan Kim <minchan.kim@xxxxxxxxx>
> >> > CC: Balbir Singh <balbir@xxxxxxxxxxxxxxxxxx>
> >> > CC: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
> >> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
> >> > Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
> >> > ---
> >>
> >> Hmm, how about this ?
> >> ==
> >> Now, nr_saved_scan is tied to zone's LRU.
> >> But, considering how vmscan works, it should be tied to reclaim_stat.
> >>
> >> By this, memcg can make use of nr_saved_scan information seamlessly.
> >
> > Good idea, full patch updated with your signed-off-by :)
> >
> > Thanks,
> > Fengguang
> > ---
> > mm: do batched scans for mem_cgroup
> >
> > For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1,
> > in which case shrink_list() _still_ calls isolate_pages() with the much
> > larger SWAP_CLUSTER_MAX. ÂIt effectively scales up the inactive list
> > scan rate by up to 32 times.
>
> Yes. It can scan 32 times pages in only inactive list, not active list.

Yes and no ;)

inactive anon list over scanned => inactive_anon_is_low() == TRUE
=> shrink_active_list()
=> active anon list over scanned

So the end result may be

- anon inactive => over scanned
- anon active => over scanned (maybe not as much)
- file inactive => over scanned
- file active => under scanned (relatively)

> > For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
> > So when shrink_zone() expects to scan 4 pages in the active/inactive
> > list, it will be scanned SWAP_CLUSTER_MAX=32 pages in effect.
>
> Active list scan would be scanned in 4, inactive list is 32.

Exactly.

> >
> > The accesses to nr_saved_scan are not lock protected and so not 100%
> > accurate, however we can tolerate small errors and the resulted small
> > imbalanced scan rates between zones.
>
> Yes.
>
> > This batching won't blur up the cgroup limits, since it is driven by
> > "pages reclaimed" rather than "pages scanned". When shrink_zone()
> > decides to cancel (and save) one smallish scan, it may well be called
> > again to accumulate up nr_saved_scan.
>
> You mean nr_scan_try_batch logic ?
> But that logic works for just global reclaim?
> Now am I missing something?
>
> Could you elaborate more? :)

Sorry for the confusion. The above paragraph originates from Balbir's
concern:

This might be a concern (although not a big ATM), since we can't
afford to miss limits by much. If a cgroup is near its limit and we
drop scanning it. We'll have to work out what this means for the end
user. May be more fundamental look through is required at the priority
based logic of exposing how much to scan, I don't know.

Thanks,
Fengguang

> > It could possibly be a problem for some tiny mem_cgroup (which may be
> > _full_ scanned too much times in order to accumulate up nr_saved_scan).
> >
> > CC: Rik van Riel <riel@xxxxxxxxxx>
> > CC: Minchan Kim <minchan.kim@xxxxxxxxx>
> > CC: Balbir Singh <balbir@xxxxxxxxxxxxxxxxxx>
> > CC: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
> > Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
> > ---
> > Âinclude/linux/mmzone.h | Â Â6 +++++-
> > Âmm/page_alloc.c    Â|  Â2 +-
> > Âmm/vmscan.c      Â|  20 +++++++++++---------
> > Â3 files changed, 17 insertions(+), 11 deletions(-)
> >
> > --- linux.orig/include/linux/mmzone.h  2009-07-30 10:45:15.000000000 +0800
> > +++ linux/include/linux/mmzone.h    Â2009-08-20 11:51:08.000000000 +0800
> > @@ -269,6 +269,11 @@ struct zone_reclaim_stat {
> > Â Â Â Â */
> >    Âunsigned long      recent_rotated[2];
> >    Âunsigned long      recent_scanned[2];
> > +
> > + Â Â Â /*
> > + Â Â Â Â* accumulated for batching
> > + Â Â Â Â*/
> > +    unsigned long      nr_saved_scan[NR_LRU_LISTS];
> > Â};
> >
> > Âstruct zone {
> > @@ -323,7 +328,6 @@ struct zone {
> >    Âspinlock_t       Âlru_lock;
> > Â Â Â Âstruct zone_lru {
> > Â Â Â Â Â Â Â Âstruct list_head list;
> > - Â Â Â Â Â Â Â unsigned long nr_saved_scan; Â Â/* accumulated for batching */
> > Â Â Â Â} lru[NR_LRU_LISTS];
> >
> > Â Â Â Âstruct zone_reclaim_stat reclaim_stat;
> > --- linux.orig/mm/vmscan.c   Â2009-08-20 11:48:46.000000000 +0800
> > +++ linux/mm/vmscan.c  2009-08-20 12:00:55.000000000 +0800
> > @@ -1521,6 +1521,7 @@ static void shrink_zone(int priority, st
> > Â Â Â Âenum lru_list l;
> > Â Â Â Âunsigned long nr_reclaimed = sc->nr_reclaimed;
> > Â Â Â Âunsigned long swap_cluster_max = sc->swap_cluster_max;
> > + Â Â Â struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> > Â Â Â Âint noswap = 0;
> >
> > Â Â Â Â/* If we have no swap space, do not bother scanning anon pages. */
> > @@ -1540,12 +1541,9 @@ static void shrink_zone(int priority, st
> > Â Â Â Â Â Â Â Â Â Â Â Âscan >>= priority;
> > Â Â Â Â Â Â Â Â Â Â Â Âscan = (scan * percent[file]) / 100;
> > Â Â Â Â Â Â Â Â}
> > - Â Â Â Â Â Â Â if (scanning_global_lru(sc))
> > - Â Â Â Â Â Â Â Â Â Â Â nr[l] = nr_scan_try_batch(scan,
> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â &zone->lru[l].nr_saved_scan,
> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â swap_cluster_max);
> > - Â Â Â Â Â Â Â else
> > - Â Â Â Â Â Â Â Â Â Â Â nr[l] = scan;
> > + Â Â Â Â Â Â Â nr[l] = nr_scan_try_batch(scan,
> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â &reclaim_stat->nr_saved_scan[l],
> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â swap_cluster_max);
> > Â Â Â Â}
> >
> > Â Â Â Âwhile (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> > @@ -2128,6 +2126,7 @@ static void shrink_all_zones(unsigned lo
> > Â{
> > Â Â Â Âstruct zone *zone;
> > Â Â Â Âunsigned long nr_reclaimed = 0;
> > + Â Â Â struct zone_reclaim_stat *reclaim_stat;
> >
> > Â Â Â Âfor_each_populated_zone(zone) {
> > Â Â Â Â Â Â Â Âenum lru_list l;
> > @@ -2144,11 +2143,14 @@ static void shrink_all_zones(unsigned lo
> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âl == LRU_ACTIVE_FILE))
> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âcontinue;
> >
> > - Â Â Â Â Â Â Â Â Â Â Â zone->lru[l].nr_saved_scan += (lru_pages >> prio) + 1;
> > - Â Â Â Â Â Â Â Â Â Â Â if (zone->lru[l].nr_saved_scan >= nr_pages || pass > 3) {
> > + Â Â Â Â Â Â Â Â Â Â Â reclaim_stat = get_reclaim_stat(zone, sc);
> > + Â Â Â Â Â Â Â Â Â Â Â reclaim_stat->nr_saved_scan[l] +=
> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â (lru_pages >> prio) + 1;
> > + Â Â Â Â Â Â Â Â Â Â Â if (reclaim_stat->nr_saved_scan[l]
> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â >= nr_pages || pass > 3) {
> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âunsigned long nr_to_scan;
> >
> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â zone->lru[l].nr_saved_scan = 0;
> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â reclaim_stat->nr_saved_scan[l] = 0;
> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Ânr_to_scan = min(nr_pages, lru_pages);
> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Ânr_reclaimed += shrink_list(l, nr_to_scan, zone,
> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âsc, prio);
> > --- linux.orig/mm/page_alloc.c Â2009-08-20 11:57:54.000000000 +0800
> > +++ linux/mm/page_alloc.c    2009-08-20 11:58:39.000000000 +0800
> > @@ -3716,7 +3716,7 @@ static void __paginginit free_area_init_
> > Â Â Â Â Â Â Â Âzone_pcp_init(zone);
> > Â Â Â Â Â Â Â Âfor_each_lru(l) {
> > Â Â Â Â Â Â Â Â Â Â Â ÂINIT_LIST_HEAD(&zone->lru[l].list);
> > - Â Â Â Â Â Â Â Â Â Â Â zone->lru[l].nr_saved_scan = 0;
> > + Â Â Â Â Â Â Â Â Â Â Â zone->reclaim_stat.nr_saved_scan[l] = 0;
> > Â Â Â Â Â Â Â Â}
> > Â Â Â Â Â Â Â Âzone->reclaim_stat.recent_rotated[0] = 0;
> > Â Â Â Â Â Â Â Âzone->reclaim_stat.recent_rotated[1] = 0;
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@xxxxxxxxxx ÂFor more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
> >
>
>
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/