Re: [patch 00/10] PI-futex: -V1

From: Ingo Molnar
Date: Sun Mar 26 2006 - 09:25:43 EST



* Esben Nielsen <simlo@xxxxxxxxxx> wrote:

> On Sun, 26 Mar 2006, Ingo Molnar wrote:
>
> > i mentioned it further down in the text - PRIO_PROTECT support (which is
> > priority ceiling) is planned for pthread mutexes. It needs no further
> > kernel changes, it's a pure userspace thing.
>
> Wouldn't this always include a call to sched_setscheduler() even for
> the fast path? And it would also involve assigning a priority to all
> locks up front.

correct. Priority ceiling based boosting can be called "manual PI" or
"open-coded PI". It is quite error-prone for more complex apps, because
not only has each thread its own priority (which must be well-designed),
but also every lock used by these threads must be assigned a maximum
priority that it will boost threads to. That priority is redundant and
can be mis-assigned.

priority ceiling locks also involve two additional syscalls per critical
section, so they are significantly slower than the atomic-ops based
PI-futex solution. [and thus it could degrade worst-case behavior by
virtue of higher locking overhead alone.]

> There are only 2 good reasons to choose this, as far as I can see. One
> is that it is more deterministic: The "fast path" is almost as slow as
> the slow path. So you will not be surprised by a sudden increase CPU
> use because timing is moved slightly. This is on the other hand
> something which can happen with PI.

correct. That on the other hand is also a disadvantage: PI is in essence
'optional boosting on an as-needed basis', while priority-ceiling is
unconditional boosting. Unconditional boosting will hurt average
latencies, which are often just as important as the worst-case
latencies.

For example: there is a high-prio, a medium-prio and a low-prio task.
There is a critical section lock, only used by the high-prio and the
low-prio task.

Under the priority ceiling logic, we manually assign 'high-prio' to the
lock [which step is redundant information, and which is an additional
source of coding/design bugs]. When the low-prio task takes the lock,
that will unconditionally boost it to high-prio, and that task might
delay the medium-prio task - even if there is no high-prio task running
right then.

If PI is used, then the kernel will automatically boost the low-prio
task to high-prio, if the high-prio task wants to take that lock too.
Otherwise the low-prio task will remain on low-prio - not delaying the
medium-prio task. So the average delays of the medium-prio task improve.

[ Furthermore, theoretically, if the application has a workflow pattern
in which the medium-prio and high-prio tasks will never run at the
same time, then priority ceiling logic would degrade the worst-case
latency of the medium-prio task too. ]

[ for completeness: PI is transitive as well, so if at such a moment the
highprio task preempts the medium prio task (which preempted the
lowprio task holding the lock), and the highprio task wants to take
the lock, then the lowprio task will be boosted to high-prio and can
finish the critical section to hand over the lock to the highprio
task. ]

i.e. PI, as implemented by this patchset, is clearly superior to
priority ceiling logic.

furthermore, there's a usability advantage to PI too: programmers often
find it more natural to assign priorities to a given 'workflow'
(task/thread), and not to every lock. I.e. i think PI is more natural.
(and hence easier to design for)

but nevertheless, the PRIO_PROTECT API for POSIX mutexes does exist, and
we can (and will) support it from userspace. All that it needed was to
make sure that setscheduler() is fully PI-aware: it must not decrease
the priority of a task to below the highest-priority waiter, and it must
'remember' the setscheduler() priority [and restore to it] after PI
effects wear off. This is implemented by our PI code anyway, so we get
correct priority ceiling (PRIO_PROTECT) behavior 'for free'.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/