Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handlingbugs

From: Nick Piggin
Date: Tue Nov 16 2004 - 18:08:49 EST


Nick Piggin wrote:
Ingo Molnar wrote:

PREEMPT_RT on SMP systems triggered weird (very high) load average
values rather easily, which turned out to be a mainline kernel
->nr_uninterruptible handling bug in try_to_wake_up().

the following code:

if (old_state == TASK_UNINTERRUPTIBLE) {
old_rq->nr_uninterruptible--;

potentially executes with old_rq potentially being != rq, and hence
updating ->nr_uninterruptible without the lock held. Given a
sufficiently concurrent preemption workload the count can get out of
whack and updates might get lost, permanently skewing the global count. Nothing except the load-average uses nr_uninterruptible() so this
condition can go unnoticed quite easily.


Hi Ingo,
Yes you're right.

I have another idea. Revert back to the old code, then just transfer
the nr_uninterruptible count when migrating a task. That way, the
rq's nr_uninterruptible field always is a measure of the number of
uninterruptible tasks on it. What do you think?

Something like this:


---

linux-2.6-npiggin/kernel/sched.c | 12 ++++++++----
1 files changed, 8 insertions(+), 4 deletions(-)

diff -puN kernel/sched.c~sched-nr_unint-fix kernel/sched.c
--- linux-2.6/kernel/sched.c~sched-nr_unint-fix 2004-11-17 09:54:36.000000000 +1100
+++ linux-2.6-npiggin/kernel/sched.c 2004-11-17 10:01:49.000000000 +1100
@@ -981,14 +981,14 @@ static int try_to_wake_up(task_t * p, un
int cpu, this_cpu, success = 0;
unsigned long flags;
long old_state;
- runqueue_t *rq, *old_rq;
+ runqueue_t *rq;
#ifdef CONFIG_SMP
unsigned long load, this_load;
struct sched_domain *sd;
int new_cpu;
#endif

- old_rq = rq = task_rq_lock(p, &flags);
+ rq = task_rq_lock(p, &flags);
schedstat_inc(rq, ttwu_cnt);
old_state = p->state;
if (!(old_state & state))
@@ -1083,7 +1083,7 @@ out_set_cpu:
out_activate:
#endif /* CONFIG_SMP */
if (old_state == TASK_UNINTERRUPTIBLE) {
- old_rq->nr_uninterruptible--;
+ rq->nr_uninterruptible--;
/*
* Tasks on involuntary sleep don't earn
* sleep_avg beyond just interactive state.
@@ -1608,8 +1608,12 @@ void pull_task(runqueue_t *src_rq, prio_
{
dequeue_task(p, src_array);
src_rq->nr_running--;
- set_task_cpu(p, this_cpu);
this_rq->nr_running++;
+ if (p->state == TASK_UNINTERRUPTIBLE) {
+ src_rq->nr_uninterruptible--;
+ this_rq->nr_uninterruptible++;
+ }
+ set_task_cpu(p, this_cpu);
enqueue_task(p, this_array);
p->timestamp = (p->timestamp - src_rq->timestamp_last_tick)
+ this_rq->timestamp_last_tick;

_