Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc

From: Valentin Schneider
Date: Wed Oct 31 2018 - 14:48:26 EST




On 31/10/2018 15:43, Steven Sistare wrote:
> On 10/29/2018 3:34 PM, Valentin Schneider wrote:
[...]
>> Suppose you have 2 rq's sharing a workload of 3 tasks. You get one rq with
>> nr_running == 1 (r_1) and one rq with nr_running == 2 (r_2).
>>
>> As soon as the task on r_1 ends/blocks, we'll go through idle balancing and
>> can potentially steal the non-running task from r_2. Sometime later the task
>> that was running on r_1 wakes up, and we end up with r_1->nr_running == 2
>> and r_2->nr_running == 1.
>>
>> IOW we've swapped their role in that example, and the whole thing can
>> repeat.
>>
>> The shorter the period of those tasks, the more we'll migrate them
>> between rq's, hence why I wonder if we shouldn't have some sort of
>> throttling.
>
> Stealing is still the right move in this scenario. Idle cycles become useful
> cycles. The only cost is the CPU time to dequeue from a remote rq and
> enqueue on the local rq. Earlier we discussed skipping try_steal() if avg_idle
> is very small, on the order of 10 usec. I think that type of throttling would
> cover your scenario. I will add it in my next version.
>

Sounds good to me. Out of curiosity, how did you establish this 10 Âsec
threshold? I guess we just want something very tiny but still big enough
to broadly cover try_steal() worst case exec times.

[...]
>> Mmm so task_hot() mainly implements two mechanisms - the CACHE_HOT_BUDDY
>> sched feature and the exec_start threshold.
>>
>> The first one should be sidestepped in the stealing case since we won't
>> pass (if env->dst_rq->nr_running), that leaves us with the threshold.
>>
>> We might want to sidestep it when we are doing balancing within an LLC
>> domain (env->sd->flags & SD_SHARE_PKG_RESOURCES) - or use a lower threshold
>> in such cases.
>>
>> In any case, I think it would make sense to add some LLC conditions to
>> task_hot() so that
>> - regular load_balance() can also benefit from them
>
> This is probably a good idea (lower threshold for task_hot within LLC).
> I would rather see it done as a separate patch, with a separate performance
> evaluation, as it will affect all workloads, even those that do not steal.

Agreed.

> A load balancing migration when !task_hot() may be performed even when
> the dst CPU already has a task to run, so the migration may or may not
> improve utilization. By contrast, a newly idle CPU that does not find
> work goes idle and definitely wastes cycles. Note how
> migrate_degrades_locality() chooses migration regardless of preferred node
> when the dst is idle:
>
> /* Leaving a core idle is often worse than degrading locality. */
> if (env->idle != CPU_NOT_IDLE)
> return -1;
>
> I apply the same principle in can_migrate_task_llc().
>
[...]
>> Right, so my line of thinking was that by not doing a load_balance() and
>> taking a shortcut (stealing a task), we may end up just postponing a
>> load_balance() to after we've stolen a task. I guess in those cases
>> there's no magic trick to be found and we just have to deal with it.
>
> In the current code I call idle_balance/load_balance first and then try_steal.
> If idle_balance fails because of cost, then it has effectively postponed itself,
> independently of stealing. The next successful call to load_balance will
> correct any imbalance caused by stealing.
>
>> And then there's some of the logic like we have in update_sd_pick_busiest()
>> where we e.g. try to prevent misfit tasks from running on LITTLEs, but
>> then if such tasks are waiting to be run and a LITTLE frees itself up,
>> I *think* it's okay to steal it.
>
> Should be OK to steal. If a BIG subsequently goes idle, load_balance will move
> the task to the BIG, or the BIG may steal it when we support misfit stealing.
>
> Questions for you, Valentin:
>
> - Should misfit stealing be a separate patch, after my series? I prefer that,
> so we get stealing in peoples hands as soon as possible. I think separating
> it is OK because stealing should not cause any regression for misfits, as
> my code still calls idle_balance/load_balance, which handles misfits.
>

I don't have any objections, Quentin (added on cc) might want to disable
stealing when !overutilized for Energy Aware Scheduling, but then that
also depends on what gets in first :)

> - Who should implement misfit stealing -- you, me, someone else? I have no
> preference.

It would make more sense if I did that, as it's easy for me to test it -
unless you're really curious about Arm stuff and want to have some fun :)

>
> - Steve
>