Re: [PATCH 00/10] steal tasks to improve CPU utilization

From: Valentin Schneider
Date: Thu Oct 25 2018 - 07:31:18 EST



On 24/10/2018 20:27, Steven Sistare wrote:
[...]
> Hi Valentin,
>
> Asymmetric systems could maintain a separate bitmap for misfits; set a bit
> when a CPU goes on CPU, clear it going off. When a fast CPU goes new idle,
> it would first search the misfits mask, then search cfs_overload_cpus.
> The misfits logic would be conditionalized with CONFIG or sched feat static
> branches so symmetric systems do not incur extra overhead.
>

That sounds reasonable - besides, misfit already introduces a
sched_asym_cpucapacity static key. I'll try to play around with that.

>> We'd also lose the NOHZ update done in idle_balance(), though I think it's
>> not such a big deal - were were piggy-backing this on idle_balance() just
>> because it happened to be convenient, and we still have NOHZ_STATS_KICK
>> anyway.
>
> Agreed.
>
>> Another thing - in your test cases, what is the most prevalent cause of
>> failure to pull a task in idle_balance()? Is it the load_balance() itself
>> that fails to find a task (e.g. because the imbalance is not deemed big
>> enough), or is it the idle migration cost logic that prevents
>> load_balance() from running to completion?
>
> The latter. Eg, for the test "X6-2, 40 CPUs, hackbench 3 process 50000",
> CPU avg_idle is 355566 nsec, and sched_migration_cost_ns = 500000,
> so idle_balance bails at the top:
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> ...
> goto out
>
> For other tests, we get past that clause but bail from a domain:
> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> ...
> break;
>
>> In the first case, try_steal() makes perfect sense to me. In the second
>> case, I'm not sure if we really want to pull something if we know (well,
>> we *think*) we're about to resume the execution of some other task.
>
> 355.566 microsec is enough time to steal, go on CPU, do useful work, and go
> off CPU, particularly for chatty workloads like hackbench. The performance
> data bear this out. For the higher loads, the average timeslice for
> hackbench
>

Thanks for the explanation. AIUI the big difference here is that try_steal()
is considerably cheaper than load_balance(), so the rq->avg_idle concerns
matter less (or at least, on a considerably smaller scale).

> Perhaps I could skip try_steal() if avg_idle is very small, although with
> hackbench I have seen average time slice as small as 10 microsec under
> high load and preemptions. I'll run some experiments.
>

That might be a safe thing to do. In the same department, maybe we could
skip try_steal() if we bail out of idle_balance() because
!(this_rq->rd->overload). Although rq->rd->overload and cfs_overload_cpus
are decoupled, they should express the same thing here.

>>> We could merge the stealing code into the idle_balance() code to get a
>>> union of the two, but IMO that would be less readable.
>>>
>>> We could remove the core and socket levels from idle_balance()
>>
>> I understand that as only doing load_balance() at DIE level in
>> idle_balance(), as that is what makes most sense to me (with big.LITTLE
>> those misfit migrations are done at DIE level), is that correct?
>
> Correct.
>> Also, with DynamIQ (next gen big.LITTLE) we could have asymmetry at MC
>> level, which could cause issues there.
>
> We could keep idle_balance for this level and fall back to stealing as in
> my patch, or you could extend the misfits bitmap to also include CPUs
> with reduced memory bandwidth and active tasks. (if I understand the asymmetry
> correctly).
>

It's mostly Âarch asymmetry, so by "asymmetry at MC level" I meant "we'll
see the SD_ASYM_CPUCAPACITY flag at MC level". But if we tweak stealing
to take misfit tasks into account (so we'd rely on SD_ASYM_CPUCAPACITY
in some way or another), that could work.

>>> and let
>>> stealing handle those levels. I think that makes sense after stealing
>>> performance is validated on more architectures, but we would still have
>>> two different mechanisms.
>>>
>>> - Steve
>>
>> I'll try out those patches on top of the misfit series to see how the
>> whole thing behaves.
>
> Very good, thanks.
>
> - Steve
>