Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock

From: Peter Zijlstra
Date: Tue May 20 2014 - 09:15:39 EST


On Mon, May 19, 2014 at 10:30:05AM -0700, bsegall@xxxxxxxxxx wrote:
> Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:
>
> > On Fri, May 16, 2014 at 12:38:21PM +0400, Roman Gushchin wrote:
> >
> >> I still think, there is a deadlock. I'll try to explain.
> >> Three CPUs must be involved:
> >> CPU0 CPU1 CPU2
> >> take rq->lock period timer fired
> >> ... take cfs_b lock
> >> ... ... tg_set_cfs_bandwidth()
> >> throttle_cfs_rq() release cfs_b lock take cfs_b lock
> >> ... distribute_cfs_runtime() timer_active = 0
> >> take cfs_b->lock wait for rq->lock ...
> >> __start_cfs_bandwidth()
> >> {wait for timer callback
> >> break if timer_active == 1}
> >>
> >> So, CPU0 and CPU1 are deadlocked.
> >
> > OK, so can someone explain this ->timer_active thing? esp. what's the
> > 'obvious' difference with hrtimer_active()?
>
> timer_active is only changed or checked under cfs_b lock, so that we can
> be certain that the timer is active whenever runtime isn't full. We
> can't use hrtimer_active because the timer might have unlocked the cfs_b
> on another cpu and be about to return HRTIMER_NORESTART.

The more I look at that code the more I'm convinced its crack, that
entire __start_cfs_bandwidth() thing is brain melting, we don't need to
cancel a timer before starting it, *hrtimer_start*() will happily remove
the timer for you if its still enqueued.

Removing that, removes a big part of the problem, no more ugly cancel
loop to get stuck in.

So now, if I understood you right, the entire reason you have this
cfs_b->lock guarded ->timer_active nonsense is to make sure we don't
accidentally loose the timer.

It appears to me that it should be possible to guarantee that same by
unconditionally (re)starting the timer when !queued. Because regardless
what hrtimer::function will return, if we beat it to (re)enqueue the
timer, it doesn't matter.

Which leads us to what I think is a BUG in the current hrtimer code (and
one wonders why we never hit that), because we drop the cpu_base->lock
over calling hrtimer::function, hrtimer_start_range_ns() can in fact
come in and (re)enqueue the timer, if hrtimer::function then returns
HRTIMER_RESTART, we'll hit that BUG_ON() before trying to enqueue the
timer once more.

So I _think_ something like the below would actually solve all the
problems, no?

Completely untested... does anybody have a 'useful' test case for this
bw stuff?

---
kernel/hrtimer.c | 9 ++++++---
kernel/sched/core.c | 10 ++++++----
kernel/sched/fair.c | 42 +++---------------------------------------
kernel/sched/sched.h | 2 +-
4 files changed, 16 insertions(+), 47 deletions(-)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 3ab28993f6e0..28942c65635e 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1273,11 +1273,14 @@ static void __run_hrtimer(struct hrtimer *timer, ktime_t *now)
* Note: We clear the CALLBACK bit after enqueue_hrtimer and
* we do not reprogramm the event hardware. Happens either in
* hrtimer_start_range_ns() or in hrtimer_interrupt()
+ *
+ * Note: Because we dropped the cpu_base->lock above,
+ * hrtimer_start_range_ns() can have popped in and enqueued the timer
+ * for us already.
*/
- if (restart != HRTIMER_NORESTART) {
- BUG_ON(timer->state != HRTIMER_STATE_CALLBACK);
+ if (restart != HRTIMER_NORESTART &&
+ !(timer->state & HRTIMER_STATE_ENQUEUED))
enqueue_hrtimer(timer, base);
- }

WARN_ON_ONCE(!(timer->state & HRTIMER_STATE_CALLBACK));

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ea7b3f1a087..829a158948a6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -106,17 +106,17 @@ void __smp_mb__after_atomic(void)
EXPORT_SYMBOL(__smp_mb__after_atomic);
#endif

-void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
+int start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
{
unsigned long delta;
ktime_t soft, hard, now;
+ int overruns = 0;

for (;;) {
- if (hrtimer_active(period_timer))
+ if (hrtimer_is_queued(period_timer))
break;

- now = hrtimer_cb_get_time(period_timer);
- hrtimer_forward(period_timer, now, period);
+ overruns += hrtimer_forward_now(period_timer, period);

soft = hrtimer_get_softexpires(period_timer);
hard = hrtimer_get_expires(period_timer);
@@ -124,6 +124,8 @@ void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
__hrtimer_start_range_ns(period_timer, soft, delta,
HRTIMER_MODE_ABS_PINNED, 0);
}
+
+ return overruns;
}

DEFINE_MUTEX(sched_domains_mutex);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 28ccf502c63c..27dfb59f6aaf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3146,15 +3146,13 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
amount = min_amount;
else {
/*
- * If the bandwidth pool has become inactive, then at least one
+ * If we had overruns starting the timer, then at least one
* period must have elapsed since the last consumption.
* Refresh the global state and ensure bandwidth timer becomes
* active.
*/
- if (!cfs_b->timer_active) {
+ if (start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period))
__refill_cfs_bandwidth_runtime(cfs_b);
- __start_cfs_bandwidth(cfs_b);
- }

if (cfs_b->runtime > 0) {
amount = min(cfs_b->runtime, min_amount);
@@ -3331,8 +3329,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
cfs_rq->throttled_clock = rq_clock(rq);
raw_spin_lock(&cfs_b->lock);
list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
- if (!cfs_b->timer_active)
- __start_cfs_bandwidth(cfs_b);
+ start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
raw_spin_unlock(&cfs_b->lock);
}

@@ -3446,13 +3443,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
if (idle)
goto out_unlock;

- /*
- * if we have relooped after returning idle once, we need to update our
- * status as actually running, so that other cpus doing
- * __start_cfs_bandwidth will stop trying to cancel us.
- */
- cfs_b->timer_active = 1;
-
__refill_cfs_bandwidth_runtime(cfs_b);

if (!throttled) {
@@ -3499,8 +3489,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
*/
cfs_b->idle = 0;
out_unlock:
- if (idle)
- cfs_b->timer_active = 0;
raw_spin_unlock(&cfs_b->lock);

return idle;
@@ -3713,30 +3701,6 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
INIT_LIST_HEAD(&cfs_rq->throttled_list);
}

-/* requires cfs_b->lock, may release to reprogram timer */
-void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
-{
- /*
- * The timer may be active because we're trying to set a new bandwidth
- * period or because we're racing with the tear-down path
- * (timer_active==0 becomes visible before the hrtimer call-back
- * terminates). In either case we ensure that it's re-programmed
- */
- while (unlikely(hrtimer_active(&cfs_b->period_timer)) &&
- hrtimer_try_to_cancel(&cfs_b->period_timer) < 0) {
- /* bounce the lock to allow do_sched_cfs_period_timer to run */
- raw_spin_unlock(&cfs_b->lock);
- cpu_relax();
- raw_spin_lock(&cfs_b->lock);
- /* if someone else restarted the timer then we're done */
- if (cfs_b->timer_active)
- return;
- }
-
- cfs_b->timer_active = 1;
- start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
-}
-
static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
{
hrtimer_cancel(&cfs_b->period_timer);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b2cbe81308af..3214eb78fd91 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -187,7 +187,7 @@ struct cfs_bandwidth {
s64 hierarchal_quota;
u64 runtime_expires;

- int idle, timer_active;
+ int idle;
struct hrtimer period_timer, slack_timer;
struct list_head throttled_cfs_rq;

Attachment: pgpMhkUg6n5tA.pgp
Description: PGP signature