[PATCH] mm: call cond_resched right before failing compaction

From: Minchan Kim
Date: Thu Jun 12 2014 - 21:59:26 EST


David reported in many case of direct compaction for THP page fault
is failed since the async compaction was abort by need_resched.
It's okay because THP could be fallback to 4K page but the problem
is if need_resched is true, we should give a chance to next process
to schedul in for the latency so that we are not greedy any more.

Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
---
mm/page_alloc.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f59fa2..1ac5133 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2617,8 +2617,16 @@ rebalance:
* system then fail the allocation instead of entering direct reclaim.
*/
if ((deferred_compaction || contended_compaction) &&
- (gfp_mask & __GFP_NO_KSWAPD))
+ (gfp_mask & __GFP_NO_KSWAPD)) {
+ /*
+ * When THP page fault occurs in large memory system,
+ * contended_compaction is likely to be true by need_resched
+ * checking so let's schedule right before returning NULL page.
+ * That makes I'm not greedy!
+ */
+ cond_resched();
goto nopage;
+ }

/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order,
--
2.0.0

With your change(ie, direct compaction is only aware of lock contetion,
not need_resched), when THP page fault occurs and it found rescheduling
while doing async direct compaction, it goes *direct reclaim path*,
not "nopage" and async direct compaction again and then finally nopage.
I think you are changing the behavior heavily to increase latency,
which is not what direct reclaim path want even though I have no data.

So, what I want is following as.
It is based on previoius inline patch.

---
mm/page_alloc.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1ac5133..8a4480e5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2624,8 +2624,17 @@ rebalance:
* checking so let's schedule right before returning NULL page.
* That makes I'm not greedy!
*/
- cond_resched();
- goto nopage;
+ int ret = cond_resched();
+
+ /* When THP page fault, we want to bail out for the latency */
+ if (!(current->flags & PF_KTHREAD) || !ret)
+ goto nopage;
+
+ /*
+ * I'm khugepaged and took a rest so want to try compaction
+ * with synchronous rather than giving up easily.
+ */
+ WARN_ON(migration_mode == MIGRATE_ASYNC);
}

/* Try direct reclaim and then allocating */
--
2.0.0

I'm off from now on. :)

>
> >I don't mean we should abort but the process could sleep and retry.
> >The point is that we should give latency pain to the process request
> >high-order alocation, not another random process.
>
> So basically you are saying that there should be cond_resched() also
> for async compaction when need_resched() is true? Now need_resched()
> is a trigger to back off rather quickly all the way back to
> __alloc_pages_direct_compact() which does contain a cond_resched().
> So there should be a yield before retry. Or are you worried that the
> back off is not quick enough and it shoudl cond_resched()
> immediately?
>
> >IMHO, if we want to increase high-order alloc ratio in page fault,
> >kswapd should be more aggressive than now via feedback loop from
> >fail rate from direct compaction.
>
> Recently I think we have been rather decreasing high-order alloc
> ratio in page fault :) But (at least for the THP) page fault
> allocation attempts contain __GFP_NO_KSWAPD, so there's no feedback
> loop. I guess changing that would be rather disruptive.
>
> >>
> >>>We have taken care of it in direct reclaim path so why direct compaction is
> >>>so special?
> >>
> >>I admit I'm not that familiar with reclaim but I didn't quickly find
> >>any need_resched() there? There's plenty of cond_resched() but that
> >>doesn't mean it will abort? Could you explain for me?
> >
> >I meant cond_resched.
> >
> >>
> >>>Why does khugepaged give up easily if lock contention/need_resched happens?
> >>>khugepaged is important for success ratio as I read your description so IMO,
> >>>khugepaged should do synchronously without considering early bail out by
> >>>lock/rescheduling.
> >>
> >>Well a stupid answer is that's how __alloc_pages_slowpath() works :)
> >>I don't think it's bad to try using first a more lightweight
> >>approach before trying the heavyweight one. As long as the
> >>heavyweight one is not skipped for khugepaged.
> >
> >I'm not saying current two-stage trying is bad. My stand is that we should
> >take care of need_resched and shouldn't become a greedy but khugepaged would
> >be okay.
> >
> >>
> >>>If it causes problems, user should increase scan_sleep_millisecs/alloc_sleep_millisecs,
> >>>which is exactly the knob for that cases.
> >>>
> >>>So, my point is how about making khugepaged doing always dumb synchronous
> >>>compaction thorough PG_KHUGEPAGED or GFP_SYNC_TRANSHUGE?
> >>>
> >>>>
> >>>>Reported-by: David Rientjes <rientjes@xxxxxxxxxx>
> >>>>Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
> >>>>Cc: Minchan Kim <minchan@xxxxxxxxxx>
> >>>>Cc: Mel Gorman <mgorman@xxxxxxx>
> >>>>Cc: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
> >>>>Cc: Michal Nazarewicz <mina86@xxxxxxxxxx>
> >>>>Cc: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
> >>>>Cc: Christoph Lameter <cl@xxxxxxxxx>
> >>>>Cc: Rik van Riel <riel@xxxxxxxxxx>
> >>>>---
> >>>> mm/compaction.c | 20 ++++++++++++++------
> >>>> mm/internal.h | 15 +++++++++++----
> >>>> 2 files changed, 25 insertions(+), 10 deletions(-)
> >>>>
> >>>>diff --git a/mm/compaction.c b/mm/compaction.c
> >>>>index b73b182..d37f4a8 100644
> >>>>--- a/mm/compaction.c
> >>>>+++ b/mm/compaction.c
> >>>>@@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
> >>>> }
> >>>> #endif /* CONFIG_COMPACTION */
> >>>>
> >>>>-static inline bool should_release_lock(spinlock_t *lock)
> >>>>+enum compact_contended should_release_lock(spinlock_t *lock)
> >>>> {
> >>>>- return need_resched() || spin_is_contended(lock);
> >>>>+ if (need_resched())
> >>>>+ return COMPACT_CONTENDED_SCHED;
> >>>>+ else if (spin_is_contended(lock))
> >>>>+ return COMPACT_CONTENDED_LOCK;
> >>>>+ else
> >>>>+ return COMPACT_CONTENDED_NONE;
> >>>> }
> >>>>
> >>>> /*
> >>>>@@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
> >>>> static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> >>>> bool locked, struct compact_control *cc)
> >>>> {
> >>>>- if (should_release_lock(lock)) {
> >>>>+ enum compact_contended contended = should_release_lock(lock);
> >>>>+
> >>>>+ if (contended) {
> >>>> if (locked) {
> >>>> spin_unlock_irqrestore(lock, *flags);
> >>>> locked = false;
> >>>>@@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> >>>>
> >>>> /* async aborts if taking too long or contended */
> >>>> if (cc->mode == MIGRATE_ASYNC) {
> >>>>- cc->contended = true;
> >>>>+ cc->contended = contended;
> >>>> return false;
> >>>> }
> >>>>
> >>>>@@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
> >>>> /* async compaction aborts if contended */
> >>>> if (need_resched()) {
> >>>> if (cc->mode == MIGRATE_ASYNC) {
> >>>>- cc->contended = true;
> >>>>+ cc->contended = COMPACT_CONTENDED_SCHED;
> >>>> return true;
> >>>> }
> >>>>
> >>>>@@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
> >>>> VM_BUG_ON(!list_empty(&cc.freepages));
> >>>> VM_BUG_ON(!list_empty(&cc.migratepages));
> >>>>
> >>>>- *contended = cc.contended;
> >>>>+ /* We only signal lock contention back to the allocator */
> >>>>+ *contended = cc.contended == COMPACT_CONTENDED_LOCK;
> >>>> return ret;
> >>>> }
> >>>>
> >>>>diff --git a/mm/internal.h b/mm/internal.h
> >>>>index 7f22a11f..4659e8e 100644
> >>>>--- a/mm/internal.h
> >>>>+++ b/mm/internal.h
> >>>>@@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
> >>>>
> >>>> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> >>>>
> >>>>+/* Used to signal whether compaction detected need_sched() or lock contention */
> >>>>+enum compact_contended {
> >>>>+ COMPACT_CONTENDED_NONE = 0, /* no contention detected */
> >>>>+ COMPACT_CONTENDED_SCHED, /* need_sched() was true */
> >>>>+ COMPACT_CONTENDED_LOCK, /* zone lock or lru_lock was contended */
> >>>>+};
> >>>>+
> >>>> /*
> >>>> * in mm/compaction.c
> >>>> */
> >>>>@@ -144,10 +151,10 @@ struct compact_control {
> >>>> int order; /* order a direct compactor needs */
> >>>> int migratetype; /* MOVABLE, RECLAIMABLE etc */
> >>>> struct zone *zone;
> >>>>- bool contended; /* True if a lock was contended, or
> >>>>- * need_resched() true during async
> >>>>- * compaction
> >>>>- */
> >>>>+ enum compact_contended contended; /* Signal need_sched() or lock
> >>>>+ * contention detected during
> >>>>+ * compaction
> >>>>+ */
> >>>> };
> >>>>
> >>>> unsigned long
> >>>>--
> >>>>1.8.4.5
> >>>>
> >>>>--
> >>>>To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>>>the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> >>>>see: http://www.linux-mm.org/ .
> >>>>Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
> >>>
> >>
> >>--
> >>To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> >>see: http://www.linux-mm.org/ .
> >>Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/