Re: Mutex vs semaphores scheduler bug

From: Török Edwin
Date: Tue Oct 20 2009 - 15:02:13 EST


On 2009-10-17 18:32, Peter Zijlstra wrote:
> On Fri, 2009-10-16 at 00:44 +0100, David Howells wrote:
>> Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>
>>> The problem appears to be that rwsem doesn't allow lock-stealing
>> With good reason. rwsems can be read or write locked for a long time - so if
>> readers can jump the queue on read-locked rwsems, then writer starvation is a
>> real possibility. I carefully implemented it so that it is a strict FIFO to
>> avoid certain problems I was having.
>
> Well, it kinda sucks that rwsem is slower than a mutex.
>
> What about allowing writer stealing when the next contending task is a
> writer?
>

Also can the scheduler be improved so it doesn't unnecessarily wake up
processes if they can't take the semaphore?

AFAICT (in linux 2.6.31.3) this is what happens:
semaphore fails to get acquired -> rwsem.c:rwsem_down_failed_common
-> if nobody holds the semaphore, wake up the process(es) that can
acquire it
-> loop calling schedule(), until woken

the thread holding the semaphore releases it (no wakeups!), and the
scheduler will schedule the next process

There are several things that are suboptimal here IMHO:
1. The task calls schedule(), but does that prevent the scheduler from
scheduling it again if waiter->task is non-null? If not, then there are
useless wakeups (task gets scheduled, waiter->task is non-null, task
calls schedule again).

2. The process in the front of the queue is woken only if there was
contention, which introduces unnecessary latency for the task in the
front of the queue waiting for the semaphore.

3. Due to 2) a task on another CPU waiting for the semaphore is only
woken: when schedule() is called, when another thread tries to acquire
the lock and fails

Suggested solution:
schedule() should know that scheduling this process is useless, it won't
get the lock, so it shouldn't attempt to schedule it (maybe by setting a
flag in the task state).
When a thread releases a semaphore, it should also try to wake other
threads that were waiting on it (if any).

With both of these changes the scheduler will know that when it
schedules the task it'll be able to acquire the semaphore (it was called
by wake_up_process in rwsem_do_wake which will clear the do-not-wake
flag). Also the thread will be scheduled as soon as another thread (on
another CPU) released the semaphore, and it doesn't need to wait for the
timeslice to elapse.

Here is a snippet of ftrace of sched_switch:
<...>-27515 [000] 44338.386867: 27515:120:R + [000] 27522:120:D <...>
<...>-27515 [000] 44338.386868: 27515:120:R + [000]
27511:120:D <...>
<...>-27515 [000] 44338.386869: 27515:120:R + [001]
27523:120:D <...>
<...>-27515 [000] 44338.386875: 27515:120:R + [003]
27512:120:D <...>
<...>-27515 [000] 44338.386877: 27515:120:R + [000]
27517:120:S <...>
<...>-27515 [000] 44338.386905: 27515:120:D ==> [000]
27517:120:R <...>
<...>-27517 [000] 44338.386907: 27517:120:S ==> [000]
27522:120:R <...>
<...>-27522 [000] 44338.386915: 27522:120:D ==> [000]
27511:120:R <...>
<...>-27511 [000] 44338.386917: 27511:120:R + [001]
27523:120:D <...>
<...>-27511 [000] 44338.386918: 27511:120:D ==> [000]
0:140:R <idle>
<idle>-0 [000] 44338.386973: 0:140:R ==> [000]
27522:120:R <...>

If I interpret this correctly 27515 wakes a bunch of readers (27511,
27523, 27512, 27517), then it switches to 27517.
However 27517 immediately schedules again (!) after 2ns, wakes 27522, it
runs for 8ns, then schedules to 27511, which runs for a short while
again, finally it schedules to idle. All this scheduling ping-pong looks
like a waste, none of the tasks woken up did useful work.

Another situation is this scheduler ping-pong with md4_raid10, notice
how 27961 wakes up md4_raid10, switches to it, md4_raid10 schedules back
to 27961, which then schedules to idle (2ns later). md4_raid10 could
have gone directly to idle ....:

md4_raid10-982 [000] 44746.579830: 982:115:S ==> [000] 27961:120:R
lt-clamd
lt-clamd-27964 [003] 44746.579892: 27964:120:R + [001]
27966:120:D lt-clamd
lt-clamd-27964 [003] 44746.580065: 27964:120:D ==> [003]
0:140:R <idle>
lt-clamd-27961 [000] 44746.580069: 27961:120:R + [003]
27964:120:D lt-clamd
<idle>-0 [003] 44746.580073: 0:140:R ==> [003]
27964:120:R lt-clamd
lt-clamd-27961 [000] 44746.580912: 27961:120:R + [002]
27960:120:S lt-clamd
lt-clamd-27961 [000] 44746.580945: 27961:120:D + [000]
982:115:S md4_raid10
lt-clamd-27961 [000] 44746.580946: 27961:120:D ==> [000]
982:115:R md4_raid10
md4_raid10-982 [000] 44746.580948: 982:115:S ==> [000]
27961:120:D lt-clamd
lt-clamd-27961 [000] 44746.580950: 27961:120:D ==> [000]
0:140:R <idle>
lt-clamd-27964 [003] 44746.581054: 27964:120:R + [000]
27961:120:D lt-clamd
<idle>-0 [000] 44746.581064: 0:140:R ==> [000]
27961:120:R lt-clamd
lt-clamd-27964 [003] 44746.581092: 27964:120:R + [002]
27960:120:S lt-clamd
lt-clamd-27964 [003] 44746.581125: 27964:120:D + [000]
982:115:S md4_raid10
lt-clamd-27964 [003] 44746.581128: 27964:120:D ==> [003]
27965:120:R lt-clamd
lt-clamd-27961 [000] 44746.581128: 27961:120:R ==> [000]
982:115:R md4_raid10


Best regards,
--Edwin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/