Re: 2.6 vs 2.4, ssh terminal slowdown

From: MIke Galbraith
Date: Sun Feb 12 2006 - 08:39:51 EST


On Fri, 2006-02-10 at 07:35 +0100, MIke Galbraith wrote:
> On Thu, 2006-02-09 at 15:06 -0500, Lee Revell wrote:
> > On Thu, 2006-02-09 at 18:06 +0100, Jan Engelhardt wrote:
> > > >> grant@deltree:~$ time grep -v 192\.168\. /var/log/apache/access_log| cut
> > > >> -c-95 ...
> > > >
> > > >What happens if you add "| cat" on the end of your command?
> > > >
> > > Do you think it's the new pipe buffering thing? (Introduced 2.6.10-.12,
> > > don't remember exactly)
> >
> > If it's the same problem I've been seeing it goes back much farther than
> > 2.6.10.
> >
> > Lately I suspect the scheduler.
>
> Hmm. I ran into an oddity while testing a modified kernel, and see
> something in schedule() that I don't think is right...
>
> Down where it does requeue_task(next, array) if a freshly awakened task
> is to possibly receive a priority boost for the time it sat on the
> runqueue, I see a potential problem. If the task didn't sit on the
> queue long enough to be promoted, and isn't at the very top, it is going
> to the back of the bus as soon it gets preempted by say xmms. For a
> task that possibly just sat through the full rotation of a busy queue
> waiting for a shot at the cpu, that has got to hurt. Speculating, that
> requeue looks like it's there to increase the queue rotation rate, ie to
> reduce latency, but it looks to me like it can also accomplish the
> opposite if the context switch rate for your queue isn't very high.
>
> ... I ended up sharing a queue with a few rampaging irman2 threads, and
> each keystroke took ages. [btw, i wonder how the heck next->array could
> not be rq->active there]

I guess you didn't try my pseudo suggestion. Since I happen to be
actively tinkering in this very area, I'll be a bit more direct :)

If you think it's the scheduler, how about try the patch below. It's
against 2.6.16-rc2-mm1, and should tell you if it is the interactivity
logic in the scheduler or not. I don't see other candidates in there,
not that that means there aren't any of course.

With this patch in place, running an irman2 in one window, a make -j4
over nfs in another, and multimedia_sim as a test application in
another, I get these results...

[mikeg@Homer]:> ./multimedia_sim 0 60
nice_level = 0
duration = 60 seconds
[frames] received: 1784 dropped: 0
[latency] mean: 0.000627 max: 0.019647 stddev: 0.000786
score: 0.004240

.... test proggy attached for inspection.

To be extra sure, set both /proc/sys/kernel/sched_g1 and g2 to 0. That
will (I mean should of course;) more or less restore the original O(1)
scheduler behavior.

-Mike

Maybe not pretty, but effective counts too...

--- linux-2.6.16-rc2-mm1x/include/linux/sched.h.org 2006-02-09 13:15:50.000000000 +0100
+++ linux-2.6.16-rc2-mm1x/include/linux/sched.h 2006-02-09 13:16:30.000000000 +0100
@@ -721,14 +721,14 @@

unsigned short ioprio;

- unsigned long sleep_avg;
+ unsigned long sleep_avg, last_slice, throttle_stamp;
unsigned long long timestamp, last_ran;
unsigned long long sched_time; /* sched_clock time spent running */
enum sleep_type sleep_type;

unsigned long policy;
cpumask_t cpus_allowed;
- unsigned int time_slice, first_time_slice;
+ unsigned int time_slice, slice_info;

#ifdef CONFIG_SCHEDSTATS
struct sched_info sched_info;
--- linux-2.6.16-rc2-mm1x/include/linux/sysctl.h.org 2006-02-09 13:16:02.000000000 +0100
+++ linux-2.6.16-rc2-mm1x/include/linux/sysctl.h 2006-02-09 13:16:30.000000000 +0100
@@ -147,6 +147,8 @@
KERN_SETUID_DUMPABLE=69, /* int: behaviour of dumps for setuid core */
KERN_SPIN_RETRY=70, /* int: number of spinlock retries */
KERN_ACPI_VIDEO_FLAGS=71, /* int: flags for setting up video after ACPI sleep */
+ KERN_SCHED_THROTTLE1=72, /* int: throttling grace period 1 in secs */
+ KERN_SCHED_THROTTLE2=73, /* int: throttling grace period 2 in secs */
};


--- linux-2.6.16-rc2-mm1x/kernel/sched.c.org 2006-02-09 13:15:04.000000000 +0100
+++ linux-2.6.16-rc2-mm1x/kernel/sched.c 2006-02-10 16:55:09.000000000 +0100
@@ -158,9 +158,195 @@
#define TASK_INTERACTIVE(p) \
((p)->prio <= (p)->static_prio - DELTA(p))

-#define INTERACTIVE_SLEEP(p) \
- (JIFFIES_TO_NS(MAX_SLEEP_AVG * \
- (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
+/*
+ * Interactivity boost can lead to serious starvation problems if the
+ * task being boosted turns out to be a cpu hog. To combat this, we
+ * compute a running slice_avg, which is the sane upper limit for the
+ * task's sleep_avg. If an 'interactive' task begins burning cpu, it's
+ * slice_avg will decay, making it visible as a problem so corrective
+ * measures can be applied.
+ *
+ * /proc/sys/kernel tunables.
+ *
+ * sched_g1: Grace period in seconds that a task is allowed to run unchecked.
+ * sched_g2: seconds thereafter, to force a priority adjustment.
+ */
+
+int sched_g1 = 20;
+int sched_g2 = 10;
+
+/*
+ * Offset from the time we noticed a potential problem until we disable the
+ * interactive bonus multiplier, and adjust sleep_avg consumption rate.
+ */
+#define G1 (sched_g1 * HZ)
+
+/*
+ * Offset thereafter that we disable the interactive bonus divisor, and adjust
+ * a runaway task's priority.
+ */
+#define G2 (sched_g2 * HZ + G1)
+
+/*
+ * Grace period has expired.
+ */
+#define grace_expired(p, grace) ((p)->throttle_stamp && \
+ time_after_eq(jiffies, (p)->throttle_stamp + (grace)))
+
+#define NEXT_PRIO (NS_MAX_SLEEP_AVG / MAX_BONUS)
+
+/*
+ * Warning: do not reduce threshold below NS_MAX_SLEEP_AVG / MAX_BONUS
+ * else you may break the case where one of a pair of communicating tasks
+ * only sleeps a miniscule amount of time, but must to be able to preempt
+ * it's partner in order to get any cpu time to speak of. If you push that
+ * task to the same level or below it's partner, it will not be able to
+ * preempt and will starve. This scenario was fixed for bonus calculation
+ * by converting sleep_avg to ns.
+ */
+#define THROTTLE_THRESHOLD (NEXT_PRIO)
+
+#define NS_MAX_SLEEP_AVG_PCNT (NS_MAX_SLEEP_AVG / 100)
+
+/*
+ * Masks for p->slice_info, formerly p->first_time_slice.
+ * SLICE_FTS: 0x80000000 Task is in it's first ever timeslice.
+ * SLICE_NEW: 0x40000000 Slice refreshed.
+ * SLICE_SPA: 0x3FFF8000 Spare bits.
+ * SLICE_LTS: 0x00007F80 Last time slice
+ * SLICE_AVG: 0x0000007F Task slice_avg stored as percentage.
+ */
+#define SLICE_AVG_BITS 7
+#define SLICE_LTS_BITS 10
+#define SLICE_SPA_BITS 13
+#define SLICE_NEW_BITS 1
+#define SLICE_FTS_BITS 1
+
+#define SLICE_AVG_SHIFT 0
+#define SLICE_LTS_SHIFT (SLICE_AVG_SHIFT + SLICE_AVG_BITS)
+#define SLICE_SPA_SHIFT (SLICE_LTS_SHIFT + SLICE_LTS_BITS)
+#define SLICE_NEW_SHIFT (SLICE_SPA_SHIFT + SLICE_SPA_BITS)
+#define SLICE_FTS_SHIFT (SLICE_NEW_SHIFT + SLICE_NEW_BITS)
+
+#define INFO_MASK(x) ((1U << (x))-1)
+#define SLICE_AVG_MASK (INFO_MASK(SLICE_AVG_BITS) << SLICE_AVG_SHIFT)
+#define SLICE_LTS_MASK (INFO_MASK(SLICE_LTS_BITS) << SLICE_LTS_SHIFT)
+#define SLICE_SPA_MASK (INFO_MASK(SLICE_SPA_BITS) << SLICE_SPA_SHIFT)
+#define SLICE_NEW_MASK (INFO_MASK(SLICE_NEW_BITS) << SLICE_NEW_SHIFT)
+#define SLICE_FTS_MASK (INFO_MASK(SLICE_FTS_BITS) << SLICE_FTS_SHIFT)
+
+#define first_time_slice(p) ((p)->slice_info & SLICE_FTS_MASK)
+#define set_first_time_slice(p) ((p)->slice_info |= SLICE_FTS_MASK)
+#define clr_first_time_slice(p) ((p)->slice_info &= ~SLICE_FTS_MASK)
+
+#define slice_is_new(p) ((p)->slice_info & SLICE_NEW_MASK)
+#define set_slice_is_new(p) ((p)->slice_info |= SLICE_NEW_MASK)
+#define clr_slice_is_new(p) ((p)->slice_info &= ~SLICE_NEW_MASK)
+
+#define last_slice(p) \
+ ((((p)->slice_info & SLICE_LTS_MASK) >> SLICE_LTS_SHIFT) ? : \
+ DEF_TIMESLICE)
+#define set_last_slice(p, n) ((p)->slice_info = (((p)->slice_info & \
+ ~SLICE_LTS_MASK) | (((n) << SLICE_LTS_SHIFT) & SLICE_LTS_MASK)))
+
+#define slice_avg(p) \
+ ((((p)->slice_info & SLICE_AVG_MASK) >> SLICE_AVG_SHIFT) * \
+ NS_MAX_SLEEP_AVG_PCNT)
+#define set_slice_avg(p, n) ((p)->slice_info = (((p)->slice_info & \
+ ~SLICE_AVG_MASK) | ((((n) / NS_MAX_SLEEP_AVG_PCNT) \
+ << SLICE_AVG_SHIFT) & SLICE_AVG_MASK)))
+#define slice_avg_raw(p) \
+ (((p)->slice_info & SLICE_AVG_MASK) >> SLICE_AVG_SHIFT)
+#define set_slice_avg_raw(p, n) ((p)->slice_info = (((p)->slice_info & \
+ ~SLICE_AVG_MASK) | (((n) << SLICE_AVG_SHIFT) & SLICE_AVG_MASK)))
+
+#define cpu_avg(p) \
+ (100 - slice_avg_raw(p))
+
+#define slice_time_avg(p) \
+ (100 * last_slice(p) / max((unsigned)cpu_avg(p), 1U))
+
+#define time_this_slice(p) \
+ (jiffies - (p)->last_slice)
+
+#define cpu_this_slice(p) \
+ (100 * last_slice(p) / max((unsigned)time_this_slice(p), \
+ (unsigned)last_slice(p)))
+
+#define this_slice_avg(p) \
+ ((100 - cpu_this_slice(p)) * NS_MAX_SLEEP_AVG_PCNT)
+
+/*
+ * In order to prevent tasks from thrashing between domesticated livestock
+ * and irate rhino, once a throttle is hung on a task, the only way to get
+ * rid of it is to change behavior. We push the throttle stamp forward in
+ * time as things improve until the stamp is in the future. Only then may
+ * we safely pull our 'tranquilizer dart'.
+ */
+#define conditional_tag(p) ((!(p)->throttle_stamp && \
+ (p)->sleep_avg > slice_avg(p) + THROTTLE_THRESHOLD) ? \
+({ \
+ ((p)->throttle_stamp = jiffies) ? : 1; \
+}) : 0)
+
+/*
+ * Those who use the least cpu receive the most encouragement.
+ */
+#define SLICE_AVG_MULTIPLIER(p) \
+ (1 + NS_TO_JIFFIES(this_slice_avg(p)) * MAX_BONUS / MAX_SLEEP_AVG)
+
+#define conditional_release(p) (((p)->throttle_stamp && \
+ (p)->sched_time >= (G2 ? JIFFIES_TO_NS(HZ) : ~0ULL) && \
+ ((20 + cpu_this_slice(p) < cpu_avg(p) && (p)->sleep_avg < \
+ slice_avg(p) + THROTTLE_THRESHOLD) || cpu_avg(p) <= 5)) ? \
+({ \
+ int __ret = 0; \
+ int delay = slice_time_avg(p) - last_slice(p); \
+ if (delay > 0) { \
+ delay *= SLICE_AVG_MULTIPLIER(p); \
+ (p)->throttle_stamp += delay; \
+ } \
+ if (time_before(jiffies, (p)->throttle_stamp)) { \
+ (p)->throttle_stamp = 0; \
+ __ret++; \
+ if (!((p)->state & TASK_NONINTERACTIVE)) \
+ (p)->sleep_type = SLEEP_NORMAL; \
+ } \
+ __ret; \
+}) : 0)
+
+/*
+ * CURRENT_BONUS(p) adjusted to match slice_avg after grace expiration.
+ */
+#define ADJUSTED_BONUS(p, grace) \
+({ \
+ unsigned long sleep_avg = (p)->sleep_avg; \
+ if (grace_expired(p, (grace))) \
+ sleep_avg = min((unsigned long)(p)->sleep_avg, \
+ (unsigned long)slice_avg(p)); \
+ NS_TO_JIFFIES(sleep_avg) * MAX_BONUS / MAX_SLEEP_AVG; \
+})
+
+#define BONUS_MULTIPLIER(p) \
+ (grace_expired(p, G1) ? : SLICE_AVG_MULTIPLIER(p))
+
+#define BONUS_DIVISOR(p) \
+ (grace_expired(p, G2) ? : (1 + ADJUSTED_BONUS(p, G1)))
+
+#define INTERACTIVE_SLEEP_AVG(p) \
+ (min(JIFFIES_TO_NS(MAX_SLEEP_AVG * (MAX_BONUS / 2 + DELTA(p)) / MAX_BONUS), \
+ NS_MAX_SLEEP_AVG))
+
+/*
+ * The quantity of sleep quaranteed to elevate a task to interactive status,
+ * or if already there, to elevate it to the next priority or beyond.
+ */
+#define INTERACTIVE_SLEEP_NS(p, ns) \
+ (BONUS_MULTIPLIER(p) * (ns) >= INTERACTIVE_SLEEP_AVG(p) || \
+ ((p)->sleep_avg < INTERACTIVE_SLEEP_AVG(p) && BONUS_MULTIPLIER(p) * \
+ (ns) + (p)->sleep_avg >= INTERACTIVE_SLEEP_AVG(p)) || \
+ ((p)->sleep_avg >= INTERACTIVE_SLEEP_AVG(p) && BONUS_MULTIPLIER(p) * \
+ (ns) + ((p)->sleep_avg % NEXT_PRIO) >= NEXT_PRIO))

#define TASK_PREEMPTS_CURR(p, rq) \
((p)->prio < (rq)->curr->prio)
@@ -668,7 +854,7 @@
if (rt_task(p))
return p->prio;

- bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
+ bonus = ADJUSTED_BONUS(p, G2) - MAX_BONUS / 2;

prio = p->static_prio - bonus;
if (prio < MAX_RT_PRIO)
@@ -794,19 +980,39 @@

if (likely(sleep_time > 0)) {
/*
- * User tasks that sleep a long time are categorised as
- * idle. They will only have their sleep_avg increased to a
+ * Tasks that sleep a long time are categorised as idle.
+ * They will only have their sleep_avg increased to a
* level that makes them just interactive priority to stay
* active yet prevent them suddenly becoming cpu hogs and
- * starving other processes.
+ * starving other processes. All tasks must stop at each
+ * TASK_INTERACTIVE boundry before moving on so that no
+ * single sleep slams it straight into NS_MAX_SLEEP_AVG.
*/
- if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) {
- unsigned long ceiling;
+ if (INTERACTIVE_SLEEP_NS(p, sleep_time)) {
+ int ticks = last_slice(p) / BONUS_DIVISOR(p);
+ unsigned long ceiling = INTERACTIVE_SLEEP_AVG(p);
+
+ ticks = JIFFIES_TO_NS(ticks);
+
+ if (grace_expired(p, G2) && slice_avg(p) < ceiling)
+ ceiling = slice_avg(p);
+ /* Promote previously interactive task. */
+ else if (p->sleep_avg >= INTERACTIVE_SLEEP_AVG(p) &&
+ !grace_expired(p, G2)) {
+
+ ceiling = p->sleep_avg / NEXT_PRIO;
+ if (ceiling < MAX_BONUS)
+ ceiling++;
+ ceiling *= NEXT_PRIO;
+ }

- ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG -
- DEF_TIMESLICE);
- if (p->sleep_avg < ceiling)
- p->sleep_avg = ceiling;
+ ceiling += ticks;
+
+ if (ceiling > NS_MAX_SLEEP_AVG)
+ ceiling = NS_MAX_SLEEP_AVG;
+
+ if (p->sleep_avg < ceiling)
+ p->sleep_avg = ceiling;
} else {

/*
@@ -816,9 +1022,8 @@
* If a task was sleeping with the noninteractive
* label do not apply this non-linear boost
*/
- if (p->sleep_type != SLEEP_NONINTERACTIVE || !p->mm)
- sleep_time *=
- (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
+ if (p->sleep_type != SLEEP_NONINTERACTIVE)
+ sleep_time *= BONUS_MULTIPLIER(p);

/*
* This code gives a bonus to interactive tasks.
@@ -1367,7 +1572,10 @@

out_activate:
#endif /* CONFIG_SMP */
- if (old_state == TASK_UNINTERRUPTIBLE) {
+
+ conditional_release(p);
+
+ if (old_state & TASK_UNINTERRUPTIBLE) {
rq->nr_uninterruptible--;
/*
* Tasks waking from uninterruptible sleep are likely
@@ -1468,9 +1676,27 @@
* The remainder of the first timeslice might be recovered by
* the parent if the child exits early enough.
*/
- p->first_time_slice = 1;
+ set_first_time_slice(p);
current->time_slice >>= 1;
p->timestamp = sched_clock();
+
+ /*
+ * Set up slice_info for the child.
+ *
+ * Note: The child inherits the parent's throttle,
+ * and must shake it loose. It does not inherit
+ * the parent's slice_avg.
+ */
+ set_slice_avg(p, NS_MAX_SLEEP_AVG);
+ set_last_slice(p, p->time_slice);
+ set_slice_is_new(p);
+ p->last_slice = jiffies;
+ /*
+ * Limit the difficulty to what the parent faced.
+ */
+ if (p->throttle_stamp && grace_expired(p, G2))
+ p->throttle_stamp = jiffies - G2;
+
if (unlikely(!current->time_slice)) {
/*
* This case is rare, it happens when the parent has only
@@ -1584,7 +1810,7 @@
* the sleep_avg of the parent as well.
*/
rq = task_rq_lock(p->parent, &flags);
- if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) {
+ if (first_time_slice(p) && task_cpu(p) == task_cpu(p->parent)) {
p->parent->time_slice += p->time_slice;
if (unlikely(p->parent->time_slice > task_timeslice(p)))
p->parent->time_slice = task_timeslice(p);
@@ -2665,6 +2891,51 @@
}

/*
+ * Calculate a task's average cpu usage rate in terms of sleep_avg, and
+ * check whether the task may soon need throttling. Must be called after
+ * refreshing the task's time slice.
+ * @p: task for which slice_avg should be computed.
+ */
+static void recalc_task_slice_avg(task_t *p)
+{
+ unsigned int slice_avg = slice_avg_raw(p);
+ unsigned int time_slice = last_slice(p);
+ int w = MAX_BONUS, idle;
+
+ if (unlikely(!time_slice))
+ set_last_slice(p, p->time_slice);
+
+ idle = 100 - cpu_this_slice(p);
+
+ /*
+ * If the task is lowering it's cpu usage, speed up the
+ * effect on slice_avg so we don't over-throttle.
+ */
+ if (idle > slice_avg) {
+ w -= idle / w;
+ if (!w)
+ w = 1;
+ }
+
+ slice_avg = (w * (slice_avg ? : 1) + idle) / (w + 1);
+
+ /* Check to see if we should start/stop throttling. */
+ if(!rt_task(p) && !conditional_release(p))
+ conditional_tag(p);
+
+ /* Update slice_avg. */
+ set_slice_avg_raw(p, slice_avg);
+
+ /* Update cached slice length. */
+ if (time_slice != p->time_slice)
+ set_last_slice(p, p->time_slice);
+
+ /* And finally, stamp and tag the new slice. */
+ set_slice_is_new(p);
+ p->last_slice = jiffies;
+}
+
+/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
@@ -2709,20 +2980,24 @@
*/
if ((p->policy == SCHED_RR) && !--p->time_slice) {
p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
+ recalc_task_slice_avg(p);
+ clr_first_time_slice(p);
set_tsk_need_resched(p);

/* put it at the end of the queue: */
requeue_task(p, rq->active);
}
+ if (unlikely(p->throttle_stamp))
+ p->throttle_stamp = 0;
goto out_unlock;
}
if (!--p->time_slice) {
dequeue_task(p, rq->active);
set_tsk_need_resched(p);
- p->prio = effective_prio(p);
p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
+ recalc_task_slice_avg(p);
+ p->prio = effective_prio(p);
+ clr_first_time_slice(p);

if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
@@ -3033,7 +3308,7 @@
* Tasks charged proportionately less run_time at high sleep_avg to
* delay them losing their interactive status
*/
- run_time /= (CURRENT_BONUS(prev) ? : 1);
+ run_time /= BONUS_DIVISOR(prev);

spin_lock_irq(&rq->lock);

@@ -3047,7 +3322,7 @@
unlikely(signal_pending(prev))))
prev->state = TASK_RUNNING;
else {
- if (prev->state == TASK_UNINTERRUPTIBLE)
+ if (prev->state & TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
deactivate_task(prev, rq);
}
@@ -3096,6 +3371,7 @@
rq->best_expired_prio = MAX_PRIO;
}

+repeat_selection:
idx = sched_find_first_bit(array->bitmap);
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);
@@ -3115,8 +3391,14 @@
dequeue_task(next, array);
next->prio = new_prio;
enqueue_task(next, array);
- } else
- requeue_task(next, array);
+
+ /*
+ * We may have just been demoted below other
+ * runnable tasks in our previous queue.
+ */
+ next->sleep_type = SLEEP_NORMAL;
+ goto repeat_selection;
+ }
}
next->sleep_type = SLEEP_NORMAL;
switch_tasks:
@@ -3134,6 +3416,14 @@
prev->sleep_avg = 0;
prev->timestamp = prev->last_ran = now;

+ /*
+ * Tag start of execution of a new timeslice.
+ */
+ if (unlikely(slice_is_new(next))) {
+ next->last_slice = jiffies;
+ clr_slice_is_new(next);
+ }
+
sched_info_switch(prev, next);
if (likely(prev != next)) {
next->timestamp = now;
--- linux-2.6.16-rc2-mm1x/kernel/sysctl.c.org 2006-02-09 13:15:17.000000000 +0100
+++ linux-2.6.16-rc2-mm1x/kernel/sysctl.c 2006-02-09 13:16:30.000000000 +0100
@@ -69,6 +69,8 @@
extern int pid_max_min, pid_max_max;
extern int sysctl_drop_caches;
extern int percpu_pagelist_fraction;
+extern int sched_g1;
+extern int sched_g2;

#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
int unknown_nmi_panic;
@@ -224,6 +226,11 @@
{ .ctl_name = 0 }
};

+/* Constants for minimum and maximum testing in vm_table and
+ * kern_table. We use these as one-element integer vectors. */
+static int zero;
+static int one_hundred = 100;
+
static ctl_table kern_table[] = {
{
.ctl_name = KERN_OSTYPE,
@@ -666,15 +673,29 @@
.proc_handler = &proc_dointvec,
},
#endif
+ {
+ .ctl_name = KERN_SCHED_THROTTLE1,
+ .procname = "sched_g1",
+ .data = &sched_g1,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
+ {
+ .ctl_name = KERN_SCHED_THROTTLE2,
+ .procname = "sched_g2",
+ .data = &sched_g2,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
{ .ctl_name = 0 }
};

-/* Constants for minimum and maximum testing in vm_table.
- We use these as one-element integer vectors. */
-static int zero;
-static int one_hundred = 100;
-
-
static ctl_table vm_table[] = {
{
.ctl_name = VM_OVERCOMMIT_MEMORY,
--- linux-2.6.16-rc2-mm1x/fs/pipe.c.org 2006-02-09 13:15:35.000000000 +0100
+++ linux-2.6.16-rc2-mm1x/fs/pipe.c 2006-02-09 13:16:30.000000000 +0100
@@ -39,11 +39,7 @@
{
DEFINE_WAIT(wait);

- /*
- * Pipes are system-local resources, so sleeping on them
- * is considered a noninteractive wait:
- */
- prepare_to_wait(PIPE_WAIT(*inode), &wait, TASK_INTERRUPTIBLE|TASK_NONINTERACTIVE);
+ prepare_to_wait(PIPE_WAIT(*inode), &wait, TASK_INTERRUPTIBLE);
mutex_unlock(PIPE_MUTEX(*inode));
schedule();
finish_wait(PIPE_WAIT(*inode), &wait);

/* multimedia_sim.c v0.3
*
* Dec 2002 - Miguel Freitas
*
* this is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
*
* this program is meant to simulate a dummy multimedia application and
* measure how it would perform in a loaded system. it basicaly tries to
* identify frame skipping problems that would have affected the movie
* playback.
*
* although the model of threads is heavily based on xine's architecture,
* its results should be also comparable with any other player program
* like mplayer or avifile. the idea is to measure when the player isn't
* scheduled in time for sending images at full frame rate and if X server
* would also be scheduled in time for displaying.
*
* of course one might try some tricks to improve performance like decreasing
* nice values for both XFree86 and player. however some distros don't
* ship the X reniced and modifying desktop menu entries to add "nice" and
* "sudo" commands is beyond most of users who just want to play their dvds...
*
* compile with: gcc -o multimedia_sim multimedia_sim.c -lpthread -lm
* run as: ./multimedia_sim [nice_level] [test_duration]
*
* note1: default CPU_BURNING value should simulate more or less a mpeg2-class
* decoding cpu usage. that will require, at least, a 300MHz processor.
*
* note2: a better simulation of xine's backend/frontend architecture would
* also include another thread (frontend) to receive the xshm completion
* events. i have intentionally not implemented it here.
*/

#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/time.h>
#include <math.h>
#include <inttypes.h>

#define FRAME_PERIOD 1000000/30 /* NTSC period in us */
#define FRAME_SIZE 720*480*3/2 /* std resolution in yv12 format */
#define PREBUFFER_FRAMES 15 /* how many frames to "decode" ahead */
#define CPU_BURNING 8 /* reduce if your cpu isn't fast enough */

int nice_level = 0;

int decoder_running;
int video_out_running;
int server_running;

pthread_mutex_t counters_lock;
pthread_cond_t wakeup_server;
pthread_mutex_t queue_lock;
pthread_cond_t enqueued;
pthread_cond_t dequeued;
int frames_sent = 0;
int frames_received = 0;
int frames_enqueued = 0;
double start_time;

pthread_mutex_t counters_lock;
pthread_cond_t wakeup_server;


/* statistics */
double total_latency = 0.0;
double total_square = 0.0;
double max_latency = 0.0;
int frames_dropped = 0;
int dropped_in_burst = 0;
int bursts = 0;


static void *alloc_frame( void ) {
void *frame;

frame = (void *)malloc(FRAME_SIZE);
memset(frame,0,FRAME_SIZE);
return frame;
}

static void dummy_enqueue_frame( void ) {
pthread_mutex_lock( &queue_lock );
while( frames_enqueued == PREBUFFER_FRAMES )
pthread_cond_wait( &dequeued, &queue_lock );
frames_enqueued++;
pthread_cond_signal( &enqueued );
pthread_mutex_unlock( &queue_lock );
}

static void dummy_dequeue_frame( void ) {
pthread_mutex_lock( &queue_lock );
while( !frames_enqueued )
pthread_cond_wait( &enqueued, &queue_lock );
frames_enqueued--;
pthread_cond_signal( &dequeued );
pthread_mutex_unlock( &queue_lock );
}

static double get_us_time( void ) {
struct timeval tv;
double us;

gettimeofday(&tv, NULL);
us = tv.tv_sec * 1e6;
us += tv.tv_usec;

return us;
}

/* from libmpeg2, used to burn cpu cycles */
#define W1 2841 /* 2048*sqrt (2)*cos (1*pi/16) */
#define W2 2676 /* 2048*sqrt (2)*cos (2*pi/16) */
#define W3 2408 /* 2048*sqrt (2)*cos (3*pi/16) */
#define W5 1609 /* 2048*sqrt (2)*cos (5*pi/16) */
#define W6 1108 /* 2048*sqrt (2)*cos (6*pi/16) */
#define W7 565 /* 2048*sqrt (2)*cos (7*pi/16) */
static void idct_row (int16_t * block)
{
int x0, x1, x2, x3, x4, x5, x6, x7, x8;

x1 = block[4] << 11;
x2 = block[6];
x3 = block[2];
x4 = block[1];
x5 = block[7];
x6 = block[5];
x7 = block[3];

x0 = (block[0] << 11) + 128; /* for proper rounding in the fourth stage */

/* first stage */
x8 = W7 * (x4 + x5);
x4 = x8 + (W1 - W7) * x4;
x5 = x8 - (W1 + W7) * x5;
x8 = W3 * (x6 + x7);
x6 = x8 - (W3 - W5) * x6;
x7 = x8 - (W3 + W5) * x7;

/* second stage */
x8 = x0 + x1;
x0 -= x1;
x1 = W6 * (x3 + x2);
x2 = x1 - (W2 + W6) * x2;
x3 = x1 + (W2 - W6) * x3;
x1 = x4 + x6;
x4 -= x6;
x6 = x5 + x7;
x5 -= x7;

/* third stage */
x7 = x8 + x3;
x8 -= x3;
x3 = x0 + x2;
x0 -= x2;
x2 = (181 * (x4 + x5) + 128) >> 8;
x4 = (181 * (x4 - x5) + 128) >> 8;

/* fourth stage */
block[0] = (x7 + x1) >> 8;
block[1] = (x3 + x2) >> 8;
block[2] = (x0 + x4) >> 8;
block[3] = (x8 + x6) >> 8;
block[4] = (x8 - x6) >> 8;
block[5] = (x0 - x4) >> 8;
block[6] = (x3 - x2) >> 8;
block[7] = (x7 - x1) >> 8;
}


static void *decoder_loop (void *this_gen) {

int16_t *frame;
int i,j;

frame = alloc_frame();
/* dummy data */
for( i = 0; i < FRAME_SIZE/sizeof(int16_t); i++ )
frame[i] = i;

while( decoder_running )
{
/* eat some cpu cycles */
for( j = 0; j < CPU_BURNING; j++ )
for( i = 0; i < FRAME_SIZE/8/sizeof(int16_t); i+=8 )
idct_row( &frame[i] );

dummy_enqueue_frame();
}

free(frame);

pthread_exit(NULL);
}


static void *video_out_loop (void *this_gen) {

double ttf; /* time to frame */
void *frame1, *frame2;

nice(nice_level);

frame1 = alloc_frame();
frame2 = alloc_frame();

start_time = get_us_time() + (FRAME_PERIOD * PREBUFFER_FRAMES);

while( video_out_running )
{
dummy_dequeue_frame();

/* eat some cpu cycles */
/*memcpy(frame1, frame2, FRAME_SIZE);*/

ttf = start_time + (frames_sent * FRAME_PERIOD);
ttf -= get_us_time();

if( ttf > 0 )
usleep( ttf );

pthread_mutex_lock( &counters_lock );
frames_sent++;
pthread_cond_signal( &wakeup_server );
pthread_mutex_unlock( &counters_lock );
}

free(frame1);
free(frame2);

pthread_exit(NULL);
}


static void *server_loop (void *this_gen) {

double estimated;
double latency;
void *frame1, *frame2;

nice(nice_level);

frame1 = alloc_frame();
frame2 = alloc_frame();

while( server_running )
{
pthread_mutex_lock( &counters_lock );
while( frames_sent <= frames_received )
pthread_cond_wait( &wakeup_server, &counters_lock );
pthread_mutex_unlock( &counters_lock );

estimated = start_time + (frames_received * FRAME_PERIOD);
latency = (get_us_time() - estimated)/1.0e6;

if( latency > max_latency )
max_latency = latency;

frames_received++;
if( latency > FRAME_PERIOD/1.0e6 ) {
frames_dropped++;
dropped_in_burst++;
} else if (dropped_in_burst) {
dropped_in_burst = 0;
bursts++;
}

total_latency += latency;
total_square += latency * latency;

/* eat some cpu cycles */
memcpy(frame1, frame2, FRAME_SIZE);
}

if (dropped_in_burst)
bursts++;

free(frame1);
free(frame2);

pthread_exit(NULL);
}


int main(int argc, char *argv[])
{
pthread_t decoder_thread;
pthread_t video_thread;
pthread_t server_thread;
void *p;
double mean, var, stddev;
double burst_size;
double score;
int duration = 10;

if( argc > 1 ) {
nice_level = atoi(argv[1]);
printf("nice_level = %d\n", nice_level );
if( nice_level < 0 )
printf("(make sure you are root for negative nice)\n");
if( argc > 2 ) {
duration = atoi(argv[2]);
printf("duration = %d seconds\n", duration );
}
}

pthread_mutex_init (&counters_lock, NULL);
pthread_cond_init (&wakeup_server, NULL);
pthread_mutex_init (&queue_lock, NULL);
pthread_cond_init (&enqueued, NULL);
pthread_cond_init (&dequeued, NULL);

server_running = 1;
if ( pthread_create (&server_thread, NULL, server_loop, NULL) != 0) {
printf("Error creating server thread.\n");
return 1;
}

video_out_running = 1;
if ( pthread_create (&video_thread, NULL, video_out_loop, NULL) != 0) {
printf("Error creating video thread.\n");
return 1;
}

decoder_running = 1;
if ( pthread_create (&decoder_thread, NULL, decoder_loop, NULL) != 0) {
printf("Error creating decoder thread.\n");
return 1;
}

sleep(duration);

server_running = 0;
pthread_join(server_thread,&p);
video_out_running = 0;
pthread_join(video_thread, &p);

printf("[frames] received: %d dropped: %d\n", frames_received, frames_dropped );
if( bursts ) {
burst_size = (double)frames_dropped / bursts;
printf("[frames] mean dropped per burst: %lf\n", burst_size );
} else
burst_size = 1.0;

mean = total_latency / frames_received;
var = total_square / frames_received - mean*mean;
stddev = sqrt(var);
printf("[latency] mean: %lf max: %lf stddev: %lf\n", mean, max_latency, stddev);

/* score: lower is better. it tries to measure "how bad" the playback
* was to the user. it counts fraction of dropped frames, if the dropped
* frames were somewhat evenly distributed (instead of in bursts), and
* also the mean and standard deviation.
* note that the formula is actually arbitrary, i'm just trying to count
* all these factors and weight them.
*/

score = 0.0;
score += 0.90 * (double) frames_dropped / frames_received * sqrt(burst_size);
score += 0.10 * (mean + stddev)/(FRAME_PERIOD/1.0e6);
printf("score: %lf\n", score );
}