Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.

From: Dongsheng Yang
Date: Mon May 05 2014 - 03:32:17 EST


Hi allï
Thanx for your reply, and sorry for the late.

On 04/23/2014 03:18 AM, Peter Zijlstra wrote:
On Tue, Apr 22, 2014 at 10:10:52AM -0700, bsegall@xxxxxxxxxx wrote:
This is all expected behavior, and the somewhat less than useful trace
events are expected. A task setting p->state to TASK_RUNNING without
locks is fine if and only p == current. The standard deschedule loop is
basically:

while (1) {
set_current_state(TASK_(UN)INTERRUPTIBLE);
if (should_still_sleep)
schedule();
}
set_current_state(TASK_RUNNING);

Which can produce this in a race.

The only problem this causes is a wasted check_preempt_curr call in the
racing case, and a somewhat inaccurate sched:sched_wakeup trace event.
Note that even if you did recheck in ttwu_do_wakeup you could still race
and get an "inaccurate" trace event.

Yes, even I recheck it in ttwu_do_wakeup(), I could still race.

Now I can not catch up a good idea to avoid this race.
But I think it is not too expensive.

About the event of sched:sched_wakeup we can get the bug information
as I mentioned at the first mail of this thread:

[root@yds-PC linux]# perf sched latency|tail
bash:25186 | 0.410 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s
bash:25174 | 0.407 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s
bash:25268 | 0.367 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s
bash:25279 | 0.358 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s
bash:25238 | 0.286 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s
-----------------------------------------------------------------------------------------
TOTAL: | 1344.539 ms | 9604 |
---------------------------------------------------
*INFO: 0.557% state machine bugs (51 out of 9161)*

It is caused by the issue we discussed in this thread, that ttwu_do_wakeup()
could be called to wakeup a task which is on run queue.

There are two solutions in my mind:
* Add a new trace event, such as sched:sched_enqueue. Then we can trace
the event of it and get the timestamp when a task enqueue and start waiting
cpu. In this way, we can calculate the latency time with
(sched_in_time - sched_enqueue_time) in `perf sched latency`.

* Move the current trace point from ttwu_do_wakeup() to
ttwu_activate(). Currently the sched:sched_wakeup can tell user very little.
When we get a sched:sched_wakeup:
a) We can not say a task is inserted into run queue, it is also used for task
which is on_rq and only change the task->state to TASK_RUNNING.
b) We can not say the task->state is changed from {UN}INTERRUPTABLE to
RUNNING, sometimes task->state is already changed to RUNNING by other cpu.

I prefer the second one, anyway, current sched_wakeup tells user none.

Heck, even if the ttwu is
_necessary_ because p is currently trying to take rq->lock to
deschedule, you won't get a matching sched_switch event, because the
ttwu is running before schedule is.

You could sorta fix this I guess by tracking every write to p->state
with trace events, but that would be a somewhat different change, and
might be considered too expensive for all I know (and the trace events
could /still/ be resolved in a different order across cpus compared to
p->state's memory).
Ah, you're saying that a second task could try a spurious wakeup between
set_current_state() and schedule(). Yes, that'll trigger this indeed.


.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/