Re: [RFC 1/2] mm: disable LRU pagevec during the migration temporarily

From: Minchan Kim
Date: Wed Feb 17 2021 - 15:52:37 EST


On Wed, Feb 17, 2021 at 10:50:55AM +0100, Michal Hocko wrote:
> On Wed 17-02-21 09:59:54, Michal Hocko wrote:
> > On Tue 16-02-21 09:03:47, Minchan Kim wrote:
> [...]
> > > /*
> > > * migrate_prep() needs to be called before we start compiling a list of pages
> > > * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
> > > @@ -64,11 +80,27 @@
> > > */
> > > void migrate_prep(void)
> > > {
> > > + unsigned int cpu;
> > > +
> > > + spin_lock(&migrate_pending_lock);
> > > + migrate_pending_count++;
> > > + spin_unlock(&migrate_pending_lock);
> >
> > I suspect you do not want to add atomic_read inside hot paths, right? Is
> > this really something that we have to microoptimize for? atomic_read is
> > a simple READ_ONCE on many archs.
>
> Or you rather wanted to prevent from read memory barrier to enfore the
> ordering.

Yub.

>
> > > +
> > > + for_each_online_cpu(cpu) {
> > > + struct work_struct *work = &per_cpu(migrate_pending_work, cpu);
> > > +
> > > + INIT_WORK(work, read_migrate_pending);
> > > + queue_work_on(cpu, mm_percpu_wq, work);
> > > + }
> > > +
> > > + for_each_online_cpu(cpu)
> > > + flush_work(&per_cpu(migrate_pending_work, cpu));
> >
> > I also do not follow this scheme. Where is the IPI you are mentioning
> > above?
>
> Thinking about it some more I think you mean the rescheduling IPI here?

True.

>
> > > + /*
> > > + * From now on, every online cpu will see uptodate
> > > + * migarte_pending_work.
> > > + */
> > > /*
> > > * Clear the LRU lists so pages can be isolated.
> > > - * Note that pages may be moved off the LRU after we have
> > > - * drained them. Those pages will fail to migrate like other
> > > - * pages that may be busy.
> > > */
> > > lru_add_drain_all();
> >
> > Overall, this looks rather heavy weight to my taste. Have you tried to
> > play with a simple atomic counter approach? atomic_read when adding to
> > the cache and atomic_inc inside migrate_prep followed by lrdu_add_drain.

I'd like to avoid atomic operation if we could.

>
> If you really want a strong ordering then it should be sufficient to
> simply alter lru_add_drain_all to force draining all CPUs. This will
> make sure no new pages are added to the pcp lists and you will also sync
> up anything that has accumulated because of a race between atomic_read
> and inc:
> diff --git a/mm/swap.c b/mm/swap.c
> index 2cca7141470c..91600d7bb7a8 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -745,7 +745,7 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
> * Calling this function with cpu hotplug locks held can actually lead
> * to obscure indirect dependencies via WQ context.
> */
> -void lru_add_drain_all(void)
> +void lru_add_drain_all(bool force_all_cpus)
> {
> /*
> * lru_drain_gen - Global pages generation number
> @@ -820,7 +820,8 @@ void lru_add_drain_all(void)
> for_each_online_cpu(cpu) {
> struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>
> - if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> + if (force_all_cpus ||
> + pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
> pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
> pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||

Yub, that's a idea.
How about this?

diff --git a/mm/migrate.c b/mm/migrate.c
index a69da8aaeccd..2531642dd9ce 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -57,6 +57,14 @@

#include "internal.h"

+static DEFINE_SPINLOCK(migrate_pending_lock);
+static unsigned long migrate_pending_count;
+
+bool migrate_pending(void)
+{
+ return migrate_pending_count;
+}
+
/*
* migrate_prep() needs to be called before we start compiling a list of pages
* to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
@@ -64,13 +72,20 @@
*/
void migrate_prep(void)
{
+ unsigned int cpu;
+
+ /*
+ * lru_add_drain_all's IPI will make sure no new pages are added
+ * to the pcp lists and drain them all.
+ */
+ spin_lock(&migrate_pending_lock);
+ migrate_pending_count++;
+ spin_unlock(&migrate_pending_lock);
+
/*
* Clear the LRU lists so pages can be isolated.
- * Note that pages may be moved off the LRU after we have
- * drained them. Those pages will fail to migrate like other
- * pages that may be busy.
*/
- lru_add_drain_all();
+ lru_add_drain_all(true);
}

/* Do the necessary work of migrate_prep but not if it involves other CPUs */
@@ -79,6 +94,15 @@ void migrate_prep_local(void)
lru_add_drain();
}

+void migrate_finish(void)
+{
+ int cpu;
+
+ spin_lock(&migrate_pending_lock);
+ migrate_pending_count--;
+ spin_unlock(&migrate_pending_lock);
+}

A odd here is there are no barrier for migrate_finish for
updating migrate_pending_count so other CPUs will see
stale value until they encounters the barrier by chance.
However, it wouldn't be a big deal, IMHO.

What do you think?