Re: [PATCH v34 00/13] Introduce Data Access MONitor (DAMON)

From: SeongJae Park
Date: Mon Aug 09 2021 - 10:07:25 EST

Next message: Alexander Duyck: "Re: [RFC PATCH 11/15] mm/page_reporting: report pages at section size instead of MAX_ORDER."
Previous message: Chun-Kuang Hu: "Re: [PATCH v2 2/5] dt-bindings: display: mediatek: dsi: add documentation for MT8167 SoC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: SeongJae Park <sjpark@xxxxxxxxx>

On Fri, 6 Aug 2021 11:48:01 +0000 SeongJae Park <sj38.park@xxxxxxxxx> wrote:

> From: SeongJae Park <sjpark@xxxxxxxxx>
>
> On Thu, 5 Aug 2021 17:03:44 -0700 Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
[...]
> >
> > I would like to see more thought/design go into how DAMON could be
> > modified to address Shakeel's other three requirements. At least to
> > the point where we can confidently say "yes, we will be able to do
> > this". Are you able to drive this discussion along please?
>
> Sure. I will describe my plan for convincing Shakeel's usages in detail as a
> reply to this mail.

Shakeel, I am explaining how DAMON will be extended and how it can be used for
your usages below. If there is any doubt or question, please feel free to let
me know.

What information DAMON (will) provides: contiguity, frequency, and recency
--------------------------------------------------------------------------

DAMON of this patchset informs users which memory region is how frequently
accessed. The memory region is a set of contiguous pages which having similar
access frequency. In addition to this, a following patch[1] will make DAMON to
track how long time the region maintained its size and access frequency. We
call this as 'age' of each region. That is, DAMON will be extended to provide
three attributes of data access patterns: contiguity (size of each region),
frequency, and recency.

Physical Address Space support
------------------------------

This version of DAMON is supporting only virtual address spaces of processes,
but will be extended to the physical address space[2]. The extension will be
quite simple because DAMON's monitoring primitives layer is separated from its
core logic.

How DAMON can be used for Shakeel's usages
------------------------------------------

The usages described in Shakeel's prior mail[1] are:

1) Working set estimation: This is used for cluster level scheduling
and controlling the knobs of memory overcommit.

2) Proactive reclaim

3) Balancing between memory tiers: Moving hot pages to fast tiers and
cold pages to slow tiers

4) Hugepage optimization: Hot memory backed by hugepages

In addition, these uses are not happening in isolation. We want a
combination of these running concurrently on a system. So, it is clear
that the first version or step of DAMON which only targets virtual
address space monitoring is not sufficient for these use-cases.

DAMON can satisfy all the usages as below.

- working set estimation: This can be done by iterating each region and
checking if the access frequency of it is higher than a threshold. Our user
space tool provides an implementation[3] for this. Below is a pseudo-code
for this:

workingsets = []
working_set_size = 0
for region in regions:
if region.access_frequncy > threshold:
workingsets.append(region)
working_set_size += region.end_address - region.start_address
return workingsets, working_set_size

- proactive reclaim: This can be done by iterating each region while checking
if it has zero access frequency and if its age is higher than a time
threshold, and reclaim those. We implemented this as a kernel module with
only 354 lines of code[4]. Below is a pseudo-code for this:

for region in regions:
if region.access_frquency == 0 and region.age > threshold:
reclaim(region)

- Balancing between memory tiers: Because DAMON provides access frequency, we
can know not only idle memory region but cold/cool/warm/hot memory region.
Once the functions for migrating pages from a tier to different tier is
matured, applying DAMON for this usage will be quite straightforward. That
is, for each region, if its access frequency and age is higher than
thresholds, migrate pages in the region to faster tier. If its access
frequency is lower than a threshold and its age is higher than a threshold,
migrate pages in the region to slower tier. Below is a pseudo-code for this:

for region in regions:
if region.age > age_threshod:
if region.access_frequency > hot_threshold:
migrate_to_fast_tier(region)
if region.access_frequency < cold_threshold:
migrate_to_slow_tier(region)

- Hugepage optimization: This will be quite similar to tiers balancing, but we
can use the size of regions. That is, we do monitoring of virtual address
spaces first. Then, for each region, if its access frequency, age, and size
are higher than thresholds (size threshold would be 2MB), makes the region to
be backed by huge pages. If the age and size are higher than thresholds but
the access frequency is lower than a threshold, makes the huge pages of the
region to be backed by regular pages. We evaluated this idea with a
prototype[5]. It removed 76.15% of THP memory overheads while preserving
51.25% of THP speedup. Below is a pseudo-code for this:

for region in regions:
if region.age > age_threshod and region.size >= 2 * MB:
if region.access_frequency > hot_threshold:
use_thps_for(region)
if region.access_frequency < cold_threshold:
use_regular_pages_for(region)

- Combination of these running concurrently: DAMON will be extended to be able
to monitor both the physical address space and virtual address spaces
simultaneously, like below.

struct damon_ctx *ctx_for_virt = damon_new_ctx();
struct damon_ctx *ctx_for_phys = damon_new_ctx();
struct damon_context *ctxs[] = {ctx_for_virt, ctx_for_phys};
[...]
/* first context for physical address space monitoring */
damon_pa_set_primitives(ctx_for_virt);
/* second context for virtual address spaces monitoring */
damon_va_set_primitives(ctx_for_phys);
damon_start(ctxs, 2);

Extending for page-granularity monitoring
-----------------------------------------

To my understanding, Shakeel wants to do above with page-granularity
monitoring. It will incur inevitable high overhead, but for someone who can
afford the cost, I will make DAMON to support it, as below.

Even with DAMON of this patchset, users can do the page-granularity monitoring
by simply setting the 'min_nr_regions' and 'max_nr_regions' of DAMON to the
number of pages in the target address space (nr_pages). Nevertheless, it will
result in creation of 'nr_pages' region structs. Assuming 4K pages, this will
result in about 1% memory waste, as each region struct consumes about 44 bytes
of memory. Our plan for removal of such overhead is as below.

In a future, the regions abstraction will be able to be entirely opted out[6].
In the case, no region structs will be allocated, so the memory overhead will
be zero. Nonetheless, the user will be required to configure DAMON to use a
special monitoring primitive which saves the monitoring results such as access
frequency and age in somewhere such as their own data structure or page flags,
like multi-gen LRU patchset does. If such data structure is commonly usable,
we can extend DAMON core to support it. To show how this will work, we
implemented a page-granularity idleness monitoring primitive with only 69 lines
of code[6].

Also, if someone has ideas for reducing the page granularity monitoring
overhead, we can put the optimization in the monitoring primitives layer, which
is separated from the core logic.

[1] https://lore.kernel.org/linux-mm/20201216084404.23183-2-sjpark@xxxxxxxxxx/
[2] https://lore.kernel.org/linux-mm/20201216094221.11898-1-sjpark@xxxxxxxxxx/
[3] https://github.com/awslabs/damo/blob/master/wss.py
[4] https://lore.kernel.org/linux-mm/20210720131309.22073-15-sj38.park@xxxxxxxxx/
[5] https://damonitor.github.io/doc/html/latest/vm/damon/eval.html#efficient-thp
[6] https://github.com/sjp38/linux/commit/9e0cb168d30e
[7] https://lore.kernel.org/linux-mm/20201216094221.11898-14-sjpark@xxxxxxxxxx/

Thanks,
SeongJae Park

Next message: Alexander Duyck: "Re: [RFC PATCH 11/15] mm/page_reporting: report pages at section size instead of MAX_ORDER."
Previous message: Chun-Kuang Hu: "Re: [PATCH v2 2/5] dt-bindings: display: mediatek: dsi: add documentation for MT8167 SoC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]