Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.

From: Atish Patra
Date: Wed Nov 01 2017 - 02:10:20 EST




On 10/31/2017 03:48 AM, Mike Galbraith wrote:
On Tue, 2017-10-31 at 09:20 +0100, Peter Zijlstra wrote:
On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
Currently, multiple tasks can wakeup on same cpu from
select_idle_sibiling() path in case they wakeup simulatenously
and last ran on the same llc. This happens because an idle cpu
is not updated until idle task is scheduled out. Any task waking
during that period may potentially select that cpu for a wakeup
candidate.

Introduce a per cpu variable that is set as soon as a cpu is
selected for wakeup for any task. This prevents from other tasks
to select the same cpu again. Note: This does not close the race
window but minimizes it to accessing the per-cpu variable. If two
wakee tasks access the per cpu variable at the same time, they may
select the same cpu again. But it minimizes the race window
considerably.
The very most important question; does it actually help? What
benchmarks, give what numbers?
Here are the numbers from one of the OLTP configuration on a 8 socket x86 machine
kernel txn/minute (normalized) user/sys
baseline 1.0 80/5
pcpu 1.021 84/5

The throughput gains are not very high and close to run-to-run variation %.
The schedstat data (added for testing in 2/2 patch) indicates the there are many instances of the
race conditions that got addressed but may be not enough to trigger a significant throughput change.

All other benchmark I tested (TPCC, hackbench, schbench, swingbench) did not show any regression.

I will let Joel post numbers from Android benchmarks.
I played with something ~similar (cmpxchg() idle cpu reservation)
I had an atomic version earlier as well. Peter's suggestion for per cpu seems to perform slightly better than atomic.
Thus, this patch has the per cpu version.
a
while back in the context of schbench, and it did help that,
Do you have the schbench configuration somewhere that I can test? I tried various configurations but did not
see any improvement or regression.
but for
generic fast mover benchmarks, the added overhead had the expected
effect, it shaved throughput a wee bit (rob Peter, pay Paul, repeat).
which benchmark ? Is it hackbench or something else ?
I have not found any regression yet in my testing. I would be happy to test if any other benchmark or different configuration
for hackbench.

Regards,
Atish
I still have the patch lying about in my rubbish heap, but didn't
bother to save any of the test results.

-Mike