Re: [RFC][PATCH] SCHED_EDF scheduling class

From: Peter Zijlstra
Date: Tue Sep 22 2009 - 07:06:31 EST


On Tue, 2009-09-22 at 12:30 +0200, Raistlin wrote:

> Some notes about the implementation, on which we are open to comments
> and suggestions:
>
> * It implements the well-known Earliest Deadline First (EDF) algorithm

Might be good to mention here that the current implementation is fully
partitioned, but you're striving to make it global-ish.

> * It does not make any restrictive assumption about the characteristics
> of the tasks: it can handle periodic, sporadic or aperiodic tasks
> (see below)

sched_param_ex seems to only contain two of the three sporadic task
model parameters ;-)

> * We decided to put the new scheduling class in between sched_rt and
> sched_fair, for the following reasons:

I'd argue to put it above rt, since having it below renders any
guarantees void, since you have no idea about how much time is actually
available.

> - Backward compatibility and (POSIX) standard compliance require
> SCHED_FIFO/SCHED_RR tasks to run as soon as they can

In short, sod POSIX.

Simply define 'as soon as they can' to exclude EDF time ;-)

> - Since EDF is more efficient than fixed-priority in terms of
> schedulability, it will be easier for it to achieve high
> utilization even if it runs in the idle time of fixed-priority,
> than vice-versa
>
> - Real-time analysis techniques already exist to deal with exactly
> such scheduling architecture
>
> - The amount of interference coming from system SCHED_FIFO/SCHED_RR
> tasks can be accounted as system overhead
>
> - In recent kernels, the bandwidth used by the whole sched_rt
> scheduling class can be forced to stay below a certain threshold

The only two tasks that need to be the absolute highest priority are
kstopmachine and migrate, so those would need to be lifted above EDF,
the rest is not important :-)

> * It works natively on multicore platform. We have one runqueue for
> each CPU, implemented with a red-black tree, for efficient
> manipulation.
>
> * We will migrate the tasks among the runqueues trying to always have,
> on a system with m CPUs, the m earliest deadline ready tasks running
> on the CPUs.
> This is not present in the attached patch, but will be available
> soon.

Ah, ok, so now its fully paritioned, but you'll add that push-pull logic
to make it 'global-ish'.

That should be done at the root_domain level, so that cpusets still work
as expected.

> * Tasks are not migrated if a specific CPU affinity has been set.
> Affinity can be set using existing Linux mechanisms (i.e.,
> sched_setaffinity() for tasks and cpuset.cpus for task groups). Thus,
> it is sufficient that each EDF task in the system has its affinity
> set, to achieve a partitioned scheduling with very few overhead.
>
> * The API for tasks uses an additional system call, called
> sched_setscheduler_ex(...), to assign and modify the scheduling
> parameters described above (i.e., sched_runtime and sched_period).
> The existing system call sched_setscheduler() has not been extended,
> because of the binary compatibility issues that modifying its struct
> sched_param parameter would have raised for existing applications. So
> we implemented an additional syscall. The system call has the
> following prototype:
>
> #define SCHED_EDF 6
>
> struct sched_param_ex {
> int sched_priority; /* Backward compliance */
> struct timespec sched_edf_period;
> struct timespec sched_edf_runtime;
> };
>
> int sched_setscheduler_ex (pid_t pid, int policy, struct sched_param_ex *param);
>
> for the sake of consistency, also sched_setaparam_ex() and
> sched_getparam_ex() have been implemented.

There should also be a sys_sched_getscheduler_ex(), the get part of the
existing interface is what stops us from extending the sched_param bits
we have.

> * We did not add any further system call to allow periodic tasks to
> specify the end of the current instance, and the semantics of the
> existing sched_yield() system call has been extended for this
> purpose.

OK, so a wakeup after a yield, before a period elapses also raises some
warning/exception/signal thing?

> * It is integrated with the cgroups filesystem, so it's possible to
> assign a reservation to a group of tasks and make hierarchical
> scheduling.
> The cgroups interface provides two entries, cpu.edf_runtime_us and
> cpu.edf_period_us. These files are created once the cgroup filesystem
> is mounted, to get and set the group bandwidth.

Nice.

> * In case a SCHED_EDF tasks forks, parent's budget is split among
> parent and child.

Right, I guess there's really nothing else you can do here...


Haven't looked at the code yet,. hope to do so shortly.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/