Re: [PATCH] [RFC] vmscan.c: add a sysctl entry for controlling memory reclaim IO congestion_wait length

From: Lin Feng
Date: Thu Sep 19 2019 - 03:47:06 EST




On 9/19/19 11:49, Matthew Wilcox wrote:
On Thu, Sep 19, 2019 at 10:33:10AM +0800, Lin Feng wrote:
On 9/18/19 20:33, Michal Hocko wrote:
I absolutely agree here. From you changelog it is also not clear what is
the underlying problem. Both congestion_wait and wait_iff_congested
should wake up early if the congestion is handled. Is this not the case?

For now I don't know why, codes seem should work as you said, maybe I need to
trace more of the internals.
But weird thing is that once I set the people-disliked-tunable iowait
drop down instantly, this is contradictory to the code design.

Yes, this is quite strange. If setting a smaller timeout makes a
difference, that indicates we're not waking up soon enough. I see
two possibilities; one is that a wakeup is missing somewhere -- ie the
conditions under which we call clear_wb_congested() are wrong. Or we
need to wake up sooner.

Umm. We have clear_wb_congested() called from exactly one spot --
clear_bdi_congested(). That is only called from:

drivers/block/pktcdvd.c
fs/ceph/addr.c
fs/fuse/control.c
fs/fuse/dev.c
fs/nfs/write.c

Jens, is something supposed to be calling clear_bdi_congested() in the
block layer? blk_clear_congested() used to exist until October 29th
last year. Or is something else supposed to be waking up tasks that
are sleeping on congestion?


IIUC it looks like after commit a1ce35fa49852db60fc6e268038530be533c5b15,
besides those *.c places as you mentioned above, vmscan codes will always
wait as long as 100ms and nobody wakes them up.

here:
1964 while (unlikely(too_many_isolated(pgdat, file, sc))) {
1965 if (stalled)
1966 return 0;
1967
1968 /* wait a bit for the reclaimer. */
>1969 msleep(100);
1970 stalled = true;
1971
1972 /* We are about to die and free our memory. Return now. */
1973 if (fatal_signal_pending(current))
1974 return SWAP_CLUSTER_MAX;
1975 }

and here:
2784 /*
2785 * If kswapd scans pages marked marked for immediate
2786 * reclaim and under writeback (nr_immediate), it
2787 * implies that pages are cycling through the LRU
2788 * faster than they are written so also forcibly stall.
2789 */
2790 if (sc->nr.immediate)
>2791 congestion_wait(BLK_RW_ASYNC, HZ/10);
2792 }

except here, codes where set_bdi_congested will clear_bdi_congested at proper time,
exactly the source files you mentioned above, so it's OK.
2808 if (!sc->hibernation_mode && !current_is_kswapd() &&
2809 current_may_throttle() && pgdat_memcg_congested(pgdat, root))
2810 wait_iff_congested(BLK_RW_ASYNC, HZ/10);