Re: [PATCH 00/10] steal tasks to improve CPU utilization

From: Steven Sistare
Date: Thu Oct 25 2018 - 08:23:19 EST


On 10/25/2018 7:31 AM, Valentin Schneider wrote:
>
> On 24/10/2018 20:27, Steven Sistare wrote:
> [...]
>> Hi Valentin,
>>
>> Asymmetric systems could maintain a separate bitmap for misfits; set a bit
>> when a CPU goes on CPU, clear it going off. When a fast CPU goes new idle,
>> it would first search the misfits mask, then search cfs_overload_cpus.
>> The misfits logic would be conditionalized with CONFIG or sched feat static
>> branches so symmetric systems do not incur extra overhead.
>
> That sounds reasonable - besides, misfit already introduces a
> sched_asym_cpucapacity static key. I'll try to play around with that.
>
>>> We'd also lose the NOHZ update done in idle_balance(), though I think it's
>>> not such a big deal - were were piggy-backing this on idle_balance() just
>>> because it happened to be convenient, and we still have NOHZ_STATS_KICK
>>> anyway.
>>
>> Agreed.
>>
>>> Another thing - in your test cases, what is the most prevalent cause of
>>> failure to pull a task in idle_balance()? Is it the load_balance() itself
>>> that fails to find a task (e.g. because the imbalance is not deemed big
>>> enough), or is it the idle migration cost logic that prevents
>>> load_balance() from running to completion?
>>
>> The latter. Eg, for the test "X6-2, 40 CPUs, hackbench 3 process 50000",
>> CPU avg_idle is 355566 nsec, and sched_migration_cost_ns = 500000,
>> so idle_balance bails at the top:
>> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
>> ...
>> goto out
>>
>> For other tests, we get past that clause but bail from a domain:
>> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
>> ...
>> break;
>>
>>> In the first case, try_steal() makes perfect sense to me. In the second
>>> case, I'm not sure if we really want to pull something if we know (well,
>>> we *think*) we're about to resume the execution of some other task.
>>
>> 355.566 microsec is enough time to steal, go on CPU, do useful work, and go
>> off CPU, particularly for chatty workloads like hackbench. The performance
>> data bear this out. For the higher loads, the average timeslice for
>> hackbench
>>
>
> Thanks for the explanation. AIUI the big difference here is that try_steal()
> is considerably cheaper than load_balance(), so the rq->avg_idle concerns
> matter less (or at least, on a considerably smaller scale).

Right.

>> Perhaps I could skip try_steal() if avg_idle is very small, although with
>> hackbench I have seen average time slice as small as 10 microsec under
>> high load and preemptions. I'll run some experiments.
>
> That might be a safe thing to do. In the same department, maybe we could
> skip try_steal() if we bail out of idle_balance() because
> !(this_rq->rd->overload). Although rq->rd->overload and cfs_overload_cpus
> are decoupled, they should express the same thing here.

I tried that in an earlier version of my code:

new_tasks = idle_balance(rq, rf);
if (new_tasks == 0 && rq->rd->overload)
new_tasks = try_steal(rq, rf);

but I did not see any performance improvement vs without the overload check,
so I omitted it for simplicity.

- Steve

>>>> We could merge the stealing code into the idle_balance() code to get a
>>>> union of the two, but IMO that would be less readable.
>>>>
>>>> We could remove the core and socket levels from idle_balance()
>>>
>>> I understand that as only doing load_balance() at DIE level in
>>> idle_balance(), as that is what makes most sense to me (with big.LITTLE
>>> those misfit migrations are done at DIE level), is that correct?
>>
>> Correct.
>>> Also, with DynamIQ (next gen big.LITTLE) we could have asymmetry at MC
>>> level, which could cause issues there.
>>
>> We could keep idle_balance for this level and fall back to stealing as in
>> my patch, or you could extend the misfits bitmap to also include CPUs
>> with reduced memory bandwidth and active tasks. (if I understand the asymmetry
>> correctly).
>>
>
> It's mostly Âarch asymmetry, so by "asymmetry at MC level" I meant "we'll
> see the SD_ASYM_CPUCAPACITY flag at MC level". But if we tweak stealing
> to take misfit tasks into account (so we'd rely on SD_ASYM_CPUCAPACITY
> in some way or another), that could work.
>
>>>> and let
>>>> stealing handle those levels. I think that makes sense after stealing
>>>> performance is validated on more architectures, but we would still have
>>>> two different mechanisms.
>>>>
>>>> - Steve
>>>
>>> I'll try out those patches on top of the misfit series to see how the
>>> whole thing behaves.
>>
>> Very good, thanks.
>>
>> - Steve
>>