Re: [PATCH -tip V3 0/8] workqueue: break affinity initiatively

From: Paul E. McKenney
Date: Mon Jan 11 2021 - 10:01:02 EST


On Mon, Jan 11, 2021 at 12:01:03PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 11, 2021 at 11:07:34AM +0100, Thomas Gleixner wrote:
> > On Fri, Jan 08 2021 at 12:46, Peter Zijlstra wrote:
> > > On Sat, Dec 26, 2020 at 10:51:08AM +0800, Lai Jiangshan wrote:
> > >> From: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
> > >>
> > >> 06249738a41a ("workqueue: Manually break affinity on hotplug")
> > >> said that scheduler will not force break affinity for us.
> > >
> > > So I've been looking at this the past day or so, and the more I look,
> > > the more I think commit:
> > >
> > > 1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug")
> > >
> > > is a real problem and we need to revert it (at least for now).
> > >
> > > Let me attempt a brain dump:
> > >
> > > - the assumption that per-cpu kernel threads are 'well behaved' on
> > > hot-plug has, I think, been proven incorrect, it's far worse than
> > > just bounded workqueue. Therefore, it makes sense to provide the old
> > > semantics.
> >
> > I disagree. Per-cpu kernel threads which are magically stopped during
> > hotplug and then migrated to a random other CPU are just wrong.
> >
> > We really need to fix that and not proliferate the sloppy and ill
> > defined behaviour.
>
> Well yes, but afaict the workqueue stuff hasn't been settled yet, and
> the rcutorture patch Paul did was just plain racy and who knows what
> other daft kthread users are out there. That and we're at -rc3.

Here is what I currently have, which passed 40 hours of each flavor of
rcutorture last night (with Lai Jiangshan's patches).

Except that I cannot explain why the "onoff_task && num_online_cpus"
is needed, just that it doesn't pass otherwise. So either I am blind
or something else is also messed up. (Why did I think to add it?
Because experimenting with suppressing individual rcutorture options
pointed in that direction.)

Also, adding that condition to that "if" statement is removing CPU
shuffling entirely from two-CPU runs, which is at best suboptimal.

> So I'm really tempted to revert for now and try again later.

I am not feeling at all good about this being in mainline quite yet. I do
agree that the semantic might be in some sense cleaner, but let's face it,
rcutorture is a very focused test. There are large parts of the kernel
that it does not stress at all. There could be any number of other bugs
involving other parts of the kernel. It would be good to have a rationale
for why this is safe beforehand rather than random experiment-driven
changes like that added "if" condition in the patch below.

Thanx, Paul

------------------------------------------------------------------------
commit dd0d40aecf506ed6982d7c98f48b4327d7d59485
Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
Date: Wed Dec 23 11:23:48 2020 -0800

torture: Break affinity of kthreads last running on outgoing CPU

The advent of commit 06249738a41a ("workqueue: Manually break affinity
on hotplug") means that the scheduler no longer silently breaks affinity
for kthreads pinned to the outgoing CPU. This can happen for many of
rcutorture's kthreads due to shuffling, which periodically affinities
these ktheads away from a randomly chosen CPU. This usually works fine
because these kthreads are allowed to run on any other CPU and because
shuffling is a no-op any time there is but one online CPU.

However, consider the following sequence of events:

1. CPUs 0 and 1 are initially online.

2. The torture_shuffle_tasks() function affinities all the tasks
away from CPU 0.

3. CPU 1 goes offline.

4. All the tasks are now affinitied to an offline CPU, triggering
the warning added by the commit noted above.

This can trigger the following in sched_cpu_dying() in kernel/sched/core.c:

BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq))

This commit therefore adds a new torture_shuffle_tasks_offline() function
that is invoked from torture_offline() prior to offlining a CPU. This new
function scans the list of shuffled kthreads and for any thread that
last ran (or is set to run) on the outgoing CPU, sets its affinity to
all online CPUs. Thus there will never be a kthread that is affinitied
only to the outgoing CPU.

Of course, if the sysadm manually applies affinity to any of these
kthreads, all bets are off. However, such a sysadm must be fast because
the torture_shuffle_tasks_offline() function is invoked immediately before
offlining the outgoing CPU. Therefore, let it be known that with great
speed and great power comes great responsibility.

Fixes: 1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug")
[ paulmck: Applied synchronization feedback from Peter Zijlstra. ]
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>

diff --git a/kernel/torture.c b/kernel/torture.c
index 01e336f..9208a03 100644
--- a/kernel/torture.c
+++ b/kernel/torture.c
@@ -65,6 +65,7 @@ static int fullstop = FULLSTOP_RMMOD;
static DEFINE_MUTEX(fullstop_mutex);

static atomic_t verbose_sleep_counter;
+static DEFINE_MUTEX(shuffle_task_mutex);

/*
* Sleep if needed from VERBOSE_TOROUT*().
@@ -153,14 +154,17 @@ int torture_hrtimeout_s(u32 baset_s, u32 fuzzt_ms, struct torture_random_state *
}
EXPORT_SYMBOL_GPL(torture_hrtimeout_s);

+static struct task_struct *onoff_task;
+
#ifdef CONFIG_HOTPLUG_CPU

+static void torture_shuffle_tasks_offline(int cpu);
+
/*
* Variables for online-offline handling. Only present if CPU hotplug
* is enabled, otherwise does nothing.
*/

-static struct task_struct *onoff_task;
static long onoff_holdoff;
static long onoff_interval;
static torture_ofl_func *onoff_f;
@@ -212,6 +216,8 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
torture_type, cpu);
starttime = jiffies;
(*n_offl_attempts)++;
+ mutex_lock(&shuffle_task_mutex);
+ torture_shuffle_tasks_offline(cpu);
ret = remove_cpu(cpu);
if (ret) {
s = "";
@@ -245,6 +251,7 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
WRITE_ONCE(torture_online_cpus, torture_online_cpus - 1);
WARN_ON_ONCE(torture_online_cpus <= 0);
}
+ mutex_unlock(&shuffle_task_mutex);

return true;
}
@@ -474,7 +481,6 @@ static struct task_struct *shuffler_task;
static cpumask_var_t shuffle_tmp_mask;
static int shuffle_idle_cpu; /* Force all torture tasks off this CPU */
static struct list_head shuffle_task_list = LIST_HEAD_INIT(shuffle_task_list);
-static DEFINE_MUTEX(shuffle_task_mutex);

/*
* Register a task to be shuffled. If there is no memory, just splat
@@ -512,6 +518,19 @@ static void torture_shuffle_task_unregister_all(void)
mutex_unlock(&shuffle_task_mutex);
}

+#ifdef CONFIG_HOTPLUG_CPU
+// Unbind all tasks from a CPU that is to be taken offline.
+static void torture_shuffle_tasks_offline(int cpu)
+{
+ struct shuffle_task *stp;
+
+ lockdep_assert_held(&shuffle_task_mutex);
+ list_for_each_entry(stp, &shuffle_task_list, st_l)
+ if (task_cpu(stp->st_t) == cpu)
+ set_cpus_allowed_ptr(stp->st_t, cpu_present_mask);
+}
+#endif // #ifdef CONFIG_HOTPLUG_CPU
+
/* Shuffle tasks such that we allow shuffle_idle_cpu to become idle.
* A special case is when shuffle_idle_cpu = -1, in which case we allow
* the tasks to run on all CPUs.
@@ -521,11 +540,13 @@ static void torture_shuffle_tasks(void)
struct shuffle_task *stp;

cpumask_setall(shuffle_tmp_mask);
- get_online_cpus();
+ mutex_lock(&shuffle_task_mutex);
+ cpus_read_lock();

/* No point in shuffling if there is only one online CPU (ex: UP) */
- if (num_online_cpus() == 1) {
- put_online_cpus();
+ if (num_online_cpus() == 1 || (onoff_task && num_online_cpus() <= 2)) {
+ cpus_read_unlock();
+ mutex_unlock(&shuffle_task_mutex);
return;
}

@@ -536,12 +557,11 @@ static void torture_shuffle_tasks(void)
else
cpumask_clear_cpu(shuffle_idle_cpu, shuffle_tmp_mask);

- mutex_lock(&shuffle_task_mutex);
list_for_each_entry(stp, &shuffle_task_list, st_l)
set_cpus_allowed_ptr(stp->st_t, shuffle_tmp_mask);
- mutex_unlock(&shuffle_task_mutex);

- put_online_cpus();
+ cpus_read_unlock();
+ mutex_unlock(&shuffle_task_mutex);
}

/* Shuffle tasks across CPUs, with the intent of allowing each CPU in the