Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks

From: Michal Hocko
Date: Tue Jul 26 2016 - 03:08:04 EST


On Mon 25-07-16 21:23:44, Michal Hocko wrote:
> [CC Marcelo who might remember other details for the loads which made
> him to add this code - see the patch changelog for more context]
>
> On Mon 25-07-16 10:32:47, Michal Hocko wrote:
[...]
> From 0d950d64e3c59061f7cca71fe5877d4e430499c9 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@xxxxxxxx>
> Date: Mon, 25 Jul 2016 14:18:54 +0200
> Subject: [PATCH] mm, vmscan: get rid of throttle_vm_writeout
>
> throttle_vm_writeout has been introduced back in 2005 to fix OOMs caused
> by excessive pageout activity during the reclaim. Too many pages could
> be put under writeback therefore LRUs would be full of unreclaimable pages
> until the IO completes and in turn the OOM killer could be invoked.
>
> There have been some important changes introduced since then in the
> reclaim path though. Writers are throttled by balance_dirty_pages
> when initiating the buffered IO and later during the memory pressure,
> the direct reclaim is throttled by wait_iff_congested if the node is
> considered congested by dirty pages on LRUs and the underlying bdi
> is congested by the queued IO. The kswapd is throttled as well if it
> encounters pages marked for immediate reclaim or under writeback which
> signals that that there are too many pages under writeback already.
> Another important aspect is that we do not issue any IO from the direct
> reclaim context anymore. In a heavy parallel load this could queue a lot
> of IO which would be very scattered and thus unefficient which would
> just make the problem worse.

And I forgot another throttling point. should_reclaim_retry which is the
main logic to decide whether we go OOM or not has a congestion_wait if
there are too many dirty/writeback pages. That should give the IO
subsystem some time to finish the IO.

> This three mechanisms should throttle and keep the amount of IO in a
> steady state even under heavy IO and memory pressure so yet another
> throttling point doesn't really seem helpful. Quite contrary, Mikulas
> Patocka has reported that swap backed by dm-crypt doesn't work properly
> because the swapout IO cannot make sufficient progress as the writeout
> path depends on dm_crypt worker which has to allocate memory to perform
> the encryption. In order to guarantee a forward progress it relies
> on the mempool allocator. mempool_alloc(), however, prefers to use
> the underlying (usually page) allocator before it grabs objects from
> the pool. Such an allocation can dive into the memory reclaim and
> consequently to throttle_vm_writeout. If there are too many dirty or
> pages under writeback it will get throttled even though it is in fact a
> flusher to clear pending pages.
>
> [ 345.352536] kworker/u4:0 D ffff88003df7f438 10488 6 2 0x00000000
> [ 345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
> [ 345.352536] ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
> [ 345.352536] ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
> [ 345.352536] ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
> [ 345.352536] Call Trace:
> [ 345.352536] [<ffffffff818d466c>] schedule+0x3c/0x90
> [ 345.352536] [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
> [ 345.352536] [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
> [ 345.352536] [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
> [ 345.352536] [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
> [ 345.352536] [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
> [ 345.352536] [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
> [ 345.352536] [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
> [ 345.352536] [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
> [ 345.352536] [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
> [ 345.352536] [<ffffffff81211720>] shrink_zone+0xe0/0x300
> [ 345.352536] [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
> [ 345.352536] [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
> [ 345.352536] [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
> [ 345.352536] [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
> [ 345.352536] [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
> [ 345.352536] [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
> [ 345.352536] [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
> [ 345.352536] [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [ 345.352536] [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
> [ 345.352536] [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [ 345.352536] [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [ 345.352536] [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [ 345.352536] [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
> [ 345.352536] [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [ 345.352536] [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
> [ 345.352536] [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
> [ 345.352536] [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
> [ 345.352536] [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
> [ 345.352536] [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]
>
> Let's just drop throttle_vm_writeout altogether. It is not very much
> helpful anymore.
>
> I have tried to test a potential writeback IO runaway similar to the one
> described in the original patch which has introduced that [1]. Small
> virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
> rather slow NFS in a sync mode on the host) with 8 parallel writers each
> writing 1G worth of data. As soon as the pagecache fills up and the
> direct reclaim hits then I start anon memory consumer in a loop
> (allocating 300M and exiting after populating it) in the background
> to make the memory pressure even stronger as well as to disrupt the
> steady state for the IO. The direct reclaim is throttled because of the
> congestion as well as kswapd hitting congestion_wait due to nr_immediate
> but throttle_vm_writeout doesn't ever trigger the sleep throughout
> the test. Dirty+writeback are close to nr_dirty_threshold with some
> fluctuations caused by the anon consumer.
>
> [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
> Cc: Marcelo Tosatti <mtosatti@xxxxxxxxxx>
> Reported-by: Mikulas Patocka <mpatocka@xxxxxxxxxx>
> Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
> ---
> include/linux/writeback.h | 1 -
> mm/page-writeback.c | 30 ------------------------------
> mm/vmscan.c | 2 --
> 3 files changed, 33 deletions(-)
>
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 44b4422ae57f..f67a992cdf89 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -319,7 +319,6 @@ void laptop_mode_timer_fn(unsigned long data);
> #else
> static inline void laptop_sync_completion(void) { }
> #endif
> -void throttle_vm_writeout(gfp_t gfp_mask);
> bool node_dirty_ok(struct pglist_data *pgdat);
> int wb_domain_init(struct wb_domain *dom, gfp_t gfp);
> #ifdef CONFIG_CGROUP_WRITEBACK
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index b82303a9e67d..2828d6ca1451 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,36 +1962,6 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
> return false;
> }
>
> -void throttle_vm_writeout(gfp_t gfp_mask)
> -{
> - unsigned long background_thresh;
> - unsigned long dirty_thresh;
> -
> - for ( ; ; ) {
> - global_dirty_limits(&background_thresh, &dirty_thresh);
> - dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);
> -
> - /*
> - * Boost the allowable dirty threshold a bit for page
> - * allocators so they don't get DoS'ed by heavy writers
> - */
> - dirty_thresh += dirty_thresh / 10; /* wheeee... */
> -
> - if (global_node_page_state(NR_UNSTABLE_NFS) +
> - global_node_page_state(NR_WRITEBACK) <= dirty_thresh)
> - break;
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> -
> - /*
> - * The caller might hold locks which can prevent IO completion
> - * or progress in the filesystem. So we cannot just sit here
> - * waiting for IO to complete.
> - */
> - if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO))
> - break;
> - }
> -}
> -
> /*
> * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
> */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0294ab34f475..0f35ed30e35b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2410,8 +2410,6 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
> if (inactive_list_is_low(lruvec, false, sc))
> shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> sc, LRU_ACTIVE_ANON);
> -
> - throttle_vm_writeout(sc->gfp_mask);
> }
>
> /* Use reclaim/compaction for costly allocs or under memory pressure */
> --
> 2.8.1
>
> --
> Michal Hocko
> SUSE Labs

--
Michal Hocko
SUSE Labs