Re: [patch 8/9] mm: thrash detection-based file cache sizing

From: Andrew Morton
Date: Tue Aug 20 2013 - 16:59:28 EST


On Sat, 17 Aug 2013 15:31:22 -0400 Johannes Weiner <hannes@xxxxxxxxxxx> wrote:

> The VM maintains cached filesystem pages on two types of lists. One
> list holds the pages recently faulted into the cache, the other list
> holds pages that have been referenced repeatedly on that first list.
> The idea is to prefer reclaiming young pages over those that have
> shown to benefit from caching in the past. We call the recently used
> list "inactive list" and the frequently used list "active list".
>
> The tricky part of this model is finding the right balance between
> them. A big inactive list may not leave enough room for the active
> list to protect all the frequently used pages. A big active list may
> not leave enough room for the inactive list for a new set of
> frequently used pages, "working set", to establish itself because the
> young pages get pushed out of memory before having a chance to get
> promoted.
>
> Historically, every reclaim scan of the inactive list also took a
> smaller number of pages from the tail of the active list and moved
> them to the head of the inactive list. This model gave established
> working sets more gracetime in the face of temporary use once streams,
> but was not satisfactory when use once streaming persisted over longer
> periods of time and the established working set was temporarily
> suspended, like a nightly backup evicting all the interactive user
> program data.
>
> Subsequently, the rules were changed to only age active pages when
> they exceeded the amount of inactive pages, i.e. leave the working set
> alone as long as the other half of memory is easy to reclaim use once
> pages. This works well until working set transitions exceed the size
> of half of memory and the average access distance between the pages of
> the new working set is bigger than the inactive list. The VM will
> mistake the thrashing new working set for use once streaming, while
> the unused old working set pages are stuck on the active list.
>
> This patch solves this problem by maintaining a history of recently
> evicted file pages, which in turn allows the VM to tell used-once page
> streams from thrashing file cache.
>
> To accomplish this, a per-zone counter is increased every time a page
> is evicted and a snapshot of that counter is stored as shadow entry in
> the page's now empty page cache radix tree slot. Upon refault of that
> page, the difference between the current value of that counter and the
> shadow entry value is called the refault distance. It tells how many
> pages have been evicted from the zone since that page's eviction,
> which is how many page slots at most are missing from the zone's
> inactive list for this page to get accessed twice while in memory. If
> the number of missing slots is less than or equal to the number of
> active pages, increasing the inactive list at the cost of the active
> list would give this thrashing set a chance to establish itself:
>
> eviction counter = 4
> evicted inactive active
> Page cache data: [ a b c d ] [ e f g h i j k ] [ l m n ]
> Shadow entries: 0 1 2 3
> Refault distance: 4 3 2 1
>
> When c is faulted back into memory, it is noted that at most two more
> page slots on the inactive list could have prevented the refault (it
> could be less if c is used out of order). Thus, the active list needs
> to be challenged as it is possible that c is used more frequently than
> l, m, n. However, there is no access frequency information available
> on active pages so the pages have to be put in direct competition with
> each other before deciding which one to keep. Thus, 1) pages can not
> be directly reclaimed from the tail of the active list and b)
> refaulting pages can not be directly activated. Instead, active pages
> are moved from the tail of the active list to the head of the inactive
> list and placed directly next to the refaulting pages. This way, they
> both have the same time on the inactive list to prove which page is
> actually used more frequently without incurring unnecessary major
> faults or diluting the active page set in case the previously active
> page is in fact the more frequently used one.
>
> Also, since the refault of c could have been due to a spurious access,
> only one active page per qualifying refault is challenged. This will
> keep the impact of outliers low but still detect if bigger groups of
> pages are refaulting.
>
> ...
>
> +/*
> + * Double CLOCK lists
> + *
> + * Per zone, two clock lists are maintained for file pages: the
> + * inactive and the active list. Freshly faulted pages start out at
> + * the head of the inactive list and page reclaim scans pages from the
> + * tail. Pages that are accessed multiple times on the inactive list
> + * are promoted to the active list, to protect them from reclaim,
> + * whereas active pages are demoted to the inactive list when the
> + * inactive list requires more space to detect repeatedly accessed
> + * pages in the current workload and prevent them from thrashing:
> + *
> + * fault -----------------------+
> + * |
> + * +-------------+ | +-------------+
> + * reclaim <- | inactive | <-+-- demotion | active | <--+
> + * +-------------+ +-------------+ |
> + * | |
> + * +----------- promotion ------------------+
> + *
> + *
> + * Access frequency and refault distance
> + *
> + * A workload is thrashing when the distances between the first and
> + * second access of pages that are frequently used is bigger than the
> + * current inactive clock list size, as the pages get reclaimed before
> + * the second access would have promoted them instead:
> + *
> + * Access #: 1 2 3 4 5 6 7 8 9
> + * Page ID: x y b c d e f x y
> + * | inactive |
> + *
> + * To prevent this workload from thrashing, a bigger inactive list is
> + * required. And the only way the inactive list can grow on a full
> + * zone is by taking away space from the corresponding active list.
> + *
> + * +-inactive--+-active------+
> + * x y | b c d e f | G H I J K L |
> + * +-----------+-------------+
> + *
> + * Not every refault should lead to growing the inactive list at the
> + * cost of the active list, however: if the access distances are
> + * bigger than available memory overall, there is little point in
> + * challenging the protected pages on the active list, as those
> + * refaulting pages will not fit completely into memory.
> + *
> + * It is prohibitively expensive to track the access frequency of
> + * in-core pages, but it is possible to track their refault distance,
> + * which is the number of page slots shrunk from the inactive list
> + * between a page's eviction and subsequent refault. This indicates
> + * how many page slots are missing on the inactive list in order to
> + * prevent future thrashing of that page. Thus, instead of comparing
> + * access frequency to total available memory, one can compare the
> + * refault distance to the inactive list's potential for growth: the
> + * size of the active list.
> + *
> + *
> + * Rebalancing the lists
> + *
> + * Shrinking the active list has to be done carefully because the
> + * pages on it may have vastly different access frequencies compared
> + * to the pages on the inactive list. Thus, pages are not reclaimed
> + * directly from the tail of the active list, but instead moved to the
> + * head of the inactive list. This way, they are competing directly
> + * with the pages that challenged their protected status. If they are
> + * unused, they will eventually be reclaimed, but if they are indeed
> + * used more frequently than the challenging inactive pages, they will
> + * be reactivated. This allows the existing protected set to be
> + * challenged without incurring major faults in case of a mistake.
> + */

The consequences of a 32-bit wraparound of the refault distance still
concern me. It's a rare occurrence and it is difficult to determine
what will happen. An explicit design-level description here would be
useful.

> +static void *pack_shadow(unsigned long time, struct zone *zone)
> +{
> + time = (time << NODES_SHIFT) | zone_to_nid(zone);
> + time = (time << ZONES_SHIFT) | zone_idx(zone);
> + time = (time << RADIX_TREE_EXCEPTIONAL_SHIFT);
> +
> + return (void *)(time | RADIX_TREE_EXCEPTIONAL_ENTRY);
> +}

"time" is normally in jiffies ;)

Some description of the underlying units of workingset_time would be
helpful.

>
> ...
>
> +/**
> + * workingset_eviction - note the eviction of a page from memory
> + * @mapping: address space the page was backing
> + * @page: the page being evicted
> + *
> + * Returns a shadow entry to be stored in @mapping->page_tree in place
> + * of the evicted @page so that a later refault can be detected. Or
> + * %NULL when the eviction should not be remembered.
> + */

Describe the locking requirements here. It's part of the interface.
And it's the part the compiler can't check, so extra care is needed.

> +void *workingset_eviction(struct address_space *mapping, struct page *page)
> +{
> + struct zone *zone = page_zone(page);
> + unsigned long time;
> +
> + time = atomic_long_inc_return(&zone->workingset_time);
> +
> + /*
> + * Don't store shadows in an inode that is being reclaimed.
> + * This is not just an optizimation, inode reclaim needs to
> + * empty out the radix tree or the nodes are lost, so don't
> + * plant shadows behind its back.
> + */
> + if (mapping_exiting(mapping))
> + return NULL;
> +
> + return pack_shadow(time, zone);
> +}
> +
> +/**
> + * workingset_refault - note the refault of a previously evicted page
> + * @shadow: shadow entry of the evicted page
> + *
> + * Calculates and evaluates the refault distance of the previously
> + * evicted page in the context of the zone it was allocated in.
> + *
> + * This primes page reclaim to rebalance the zone's file lists if
> + * necessary, so it must be called before a page frame for the
> + * refaulting page is allocated.
> + */
> +void workingset_refault(void *shadow)
> +{
> + unsigned long refault_distance;

Is the "refault distance" described somewhere? What relationship (if
any) does it have with workingset_time?

> + struct zone *zone;
> +
> + unpack_shadow(shadow, &zone, &refault_distance);
> +
> + inc_zone_state(zone, WORKINGSET_REFAULT);
> +
> + /*
> + * Protected pages should be challenged when the refault
> + * distance indicates that thrashing could be stopped by
> + * increasing the inactive list at the cost of the active
> + * list.
> + */
> + if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
> + inc_zone_state(zone, WORKINGSET_STALE);
> + zone->shrink_active++;
> + }
> +}
> +EXPORT_SYMBOL(workingset_refault);
> +
> +/**
> + * workingset_activation - note a page activation
> + * @page: page that is being activated
> + */
> +void workingset_activation(struct page *page)
> +{
> + struct zone *zone = page_zone(page);
> +
> + /*
> + * The lists are rebalanced when the inactive list is observed
> + * to be too small for activations. An activation means that
> + * the inactive list is now big enough again for at least one
> + * page, so back off further deactivation.
> + */
> + atomic_long_inc(&zone->workingset_time);
> + if (zone->shrink_active > 0)
> + zone->shrink_active--;
> +}

Strange mixture of exports and non-exports. I assume you went with
"enough to make it build".

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/