Re: [PATCH 04/34] sched: add cpumask_find_and_set() and use it in __mm_cid_get()

From: Mathieu Desnoyers
Date: Mon Nov 20 2023 - 11:17:18 EST


On 2023-11-20 06:31, Peter Zijlstra wrote:
On Sat, Nov 18, 2023 at 07:50:35AM -0800, Yury Norov wrote:
__mm_cid_get() uses a __mm_cid_try_get() helper to atomically acquire a
bit in mm cid mask. Now that we have atomic find_and_set_bit(), we can
easily extend it to cpumasks and use in the scheduler code.

__mm_cid_try_get() has an infinite loop, which may delay forward
progress of __mm_cid_get() when the mask is dense. The
cpumask_find_and_set() doesn't poll the mask infinitely, and returns as
soon as nothing has found after the first iteration, allowing to acquire
the lock, and set use_cid_lock faster, if needed.

Methieu, I forgot again, but the comment delete seems to suggest you did
this on purpose...

See comments below.


cpumask_find_and_set() considers cid mask as a volatile region of memory,
as it actually is in this case. So, if it's changed while search is in
progress, KCSAN wouldn't fire warning on it.

Signed-off-by: Yury Norov <yury.norov@xxxxxxxxx>
---
include/linux/cpumask.h | 12 ++++++++++
kernel/sched/sched.h | 52 ++++++++++++-----------------------------
2 files changed, 27 insertions(+), 37 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index cfb545841a2c..c2acced8be4e 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -271,6 +271,18 @@ unsigned int cpumask_next_and(int n, const struct cpumask *src1p,
small_cpumask_bits, n + 1);
}
+/**
+ * cpumask_find_and_set - find the first unset cpu in a cpumask and
+ * set it atomically
+ * @srcp: the cpumask pointer
+ *
+ * Return: >= nr_cpu_ids if nothing is found.
+ */
+static inline unsigned int cpumask_find_and_set(volatile struct cpumask *srcp)
+{
+ return find_and_set_bit(cpumask_bits(srcp), small_cpumask_bits);
+}
+
/**
* for_each_cpu - iterate over every cpu in a mask
* @cpu: the (optionally unsigned) integer iterator
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e5a95486a42..b2f095a9fc40 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3345,28 +3345,6 @@ static inline void mm_cid_put(struct mm_struct *mm)
__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
}
-static inline int __mm_cid_try_get(struct mm_struct *mm)
-{
- struct cpumask *cpumask;
- int cid;
-
- cpumask = mm_cidmask(mm);
- /*
- * Retry finding first zero bit if the mask is temporarily
- * filled. This only happens during concurrent remote-clear
- * which owns a cid without holding a rq lock.
- */
- for (;;) {
- cid = cpumask_first_zero(cpumask);
- if (cid < nr_cpu_ids)
- break;
- cpu_relax();
- }
- if (cpumask_test_and_set_cpu(cid, cpumask))
- return -1;

This was split in find / test_and_set on purpose because following
patches I have (implementing numa-aware mm_cid) have a scan which
needs to scan sets of two cpumasks in parallel (with "and" and
and_not" operators).

Moreover, the "mask full" scenario only happens while a concurrent
remote-clear temporarily owns a cid without rq lock. See
sched_mm_cid_remote_clear():

/*
* The cid is unused, so it can be unset.
* Disable interrupts to keep the window of cid ownership without rq
* lock small.
*/
local_irq_save(flags);
if (try_cmpxchg(&pcpu_cid->cid, &lazy_cid, MM_CID_UNSET))
__mm_cid_put(mm, cid);
local_irq_restore(flags);

The proposed patch here turns this scenario into something heavier
(setting the use_cid_lock) rather than just retrying. I guess the
question to ask here is whether it is theoretically possible to cause
__mm_cid_try_get() to fail to have forward progress if we have a high
rate of sched_mm_cid_remote_clear. If we decide that this is indeed
a possible progress-failure scenario, then it makes sense to fallback
to use_cid_lock as soon as a full mask is encountered.

However, removing the __mm_cid_try_get() helper will make it harder to
integrate the following numa-awareness patches I have on top.

I am not against using cpumask_find_and_set, but can we keep the
__mm_cid_try_get() helper to facilitate integration of future work ?
We just have to make it use cpumask_find_and_set, which should be
easy.

- return cid;
-}
-
/*
* Save a snapshot of the current runqueue time of this cpu
* with the per-cpu cid value, allowing to estimate how recently it was used.
@@ -3381,25 +3359,25 @@ static inline void mm_cid_snapshot_time(struct rq *rq, struct mm_struct *mm)
static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
{
+ struct cpumask *cpumask = mm_cidmask(mm);
int cid;
- /*
- * All allocations (even those using the cid_lock) are lock-free. If
- * use_cid_lock is set, hold the cid_lock to perform cid allocation to
- * guarantee forward progress.
- */
+ /* All allocations (even those using the cid_lock) are lock-free. */
if (!READ_ONCE(use_cid_lock)) {
- cid = __mm_cid_try_get(mm);
- if (cid >= 0)
+ cid = cpumask_find_and_set(cpumask);
+ if (cid < nr_cpu_ids)
goto end;
- raw_spin_lock(&cid_lock);
- } else {
- raw_spin_lock(&cid_lock);
- cid = __mm_cid_try_get(mm);
- if (cid >= 0)
- goto unlock;
}
+ /*
+ * If use_cid_lock is set, hold the cid_lock to perform cid
+ * allocation to guarantee forward progress.
+ */
+ raw_spin_lock(&cid_lock);
+ cid = cpumask_find_and_set(cpumask);
+ if (cid < nr_cpu_ids)
+ goto unlock;

In the !use_cid_lock case where we already failed a lookup above, this change
ends up doing another attempt at lookup before setting the use_cid_lock and
attempting again until success. I am not sure what is the motivation for changing
the code flow here ?

General comment about the rest of the series: please review code comments for
typos.

Thanks,

Mathieu

+
/*
* cid concurrently allocated. Retry while forcing following
* allocations to use the cid_lock to ensure forward progress.
@@ -3415,9 +3393,9 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
* all newcoming allocations observe the use_cid_lock flag set.
*/
do {
- cid = __mm_cid_try_get(mm);
+ cid = cpumask_find_and_set(cpumask);
cpu_relax();
- } while (cid < 0);
+ } while (cid >= nr_cpu_ids);
/*
* Allocate before clearing use_cid_lock. Only care about
* program order because this is for forward progress.
--
2.39.2


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com