Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc

From: Steven Sistare
Date: Wed Oct 31 2018 - 11:44:26 EST


On 10/29/2018 3:34 PM, Valentin Schneider wrote:
> On 26/10/2018 19:28, Steven Sistare wrote:
>> On 10/26/2018 2:04 PM, Valentin Schneider wrote:
> [...]
>>>
>>> I was thinking that perhaps we could have scenarios where some rq's
>>> keep stealing tasks off of each other and we end up circulating tasks
>>> between CPUs. Now, that would only happen if we had a handful of tasks
>>> with a very tiny period, and I'm not familiar with (real) such hyperactive
>>> workloads similar to those generated by hackbench where that could happen.
>>
>> That will not happen with the current code, as it only steals if nr_running > 1.
>> The src loses a task, the dst gains it and has nr_running == 1, so it will not
>> be re-stolen.
>
> That's indeed fine, I was thinking of something like this:
>
> Suppose you have 2 rq's sharing a workload of 3 tasks. You get one rq with
> nr_running == 1 (r_1) and one rq with nr_running == 2 (r_2).
>
> As soon as the task on r_1 ends/blocks, we'll go through idle balancing and
> can potentially steal the non-running task from r_2. Sometime later the task
> that was running on r_1 wakes up, and we end up with r_1->nr_running == 2
> and r_2->nr_running == 1.
>
> IOW we've swapped their role in that example, and the whole thing can
> repeat.
>
> The shorter the period of those tasks, the more we'll migrate them
> between rq's, hence why I wonder if we shouldn't have some sort of
> throttling.

Stealing is still the right move in this scenario. Idle cycles become useful
cycles. The only cost is the CPU time to dequeue from a remote rq and
enqueue on the local rq. Earlier we discussed skipping try_steal() if avg_idle
is very small, on the order of 10 usec. I think that type of throttling would
cover your scenario. I will add it in my next version.

>> If we modify the code to handle misfits, we may steal when src nr_running == 1,
>> but a fast CPU will only steal the lone task from a slow one, never fast from fast,
>> and never slow from fast, so no tug of war.
>>
>>> In short, I wonder if we should have task_hot() in there. Drawing a
>>> parallel with load_balance(), even if load-balancing is happening between
>>> rqs of the same LLC, we do go check task_hot(). Have you already experimented
>>> with adding a task_hot() check in here?
>>
>> I tried task_hot, to see if L1/L2 cache warmth matters much on L1/L2/L3 systems,
>> and it reduced steals and overall performance.
>
> Mmm so task_hot() mainly implements two mechanisms - the CACHE_HOT_BUDDY
> sched feature and the exec_start threshold.
>
> The first one should be sidestepped in the stealing case since we won't
> pass (if env->dst_rq->nr_running), that leaves us with the threshold.
>
> We might want to sidestep it when we are doing balancing within an LLC
> domain (env->sd->flags & SD_SHARE_PKG_RESOURCES) - or use a lower threshold
> in such cases.
>
> In any case, I think it would make sense to add some LLC conditions to
> task_hot() so that
> - regular load_balance() can also benefit from them

This is probably a good idea (lower threshold for task_hot within LLC).
I would rather see it done as a separate patch, with a separate performance
evaluation, as it will affect all workloads, even those that do not steal.
A load balancing migration when !task_hot() may be performed even when
the dst CPU already has a task to run, so the migration may or may not
improve utilization. By contrast, a newly idle CPU that does not find
work goes idle and definitely wastes cycles. Note how
migrate_degrades_locality() chooses migration regardless of preferred node
when the dst is idle:

/* Leaving a core idle is often worse than degrading locality. */
if (env->idle != CPU_NOT_IDLE)
return -1;

I apply the same principle in can_migrate_task_llc().

> - task stealing has at least some sort of throttling
>
>
> On a sidenote, I find it a bit odd that the exec_start threshold depends on
> sysctl_sched_migration_cost, which to me is more about idle_balance() cost
> than "how long does it take for a previously run task to go cache cold".

Agreed, but these are all magic numbers anyway :)

>>> I've run some iterations of hackbench (hackbench 2 process 100000) to
>>> investigate this task bouncing, but I didn't really see any of it. That was
>>> just a 4+4 big.LITTLE system though, I'll try to get numbers on a system
>>> with more CPUs.
>>>
>>> ----->8-----
>>>
>>> activations: # of task activations (task starts running)
>>> cpu_migrations: # of activations where cpu != prev_cpu
>>> % stats are percentiles
>>>
>>> - STEAL:
>>>
>>> | stat | cpu_migrations | activations |
>>> |-------+----------------+-------------|
>>> | count | 2005.000000 | 2005.000000 |
>>> | mean | 16.244888 | 290.608479 |
>>> | std | 38.963138 | 253.003528 |
>>> | min | 0.000000 | 3.000000 |
>>> | 50% | 3.000000 | 239.000000 |
>>> | 75% | 8.000000 | 436.000000 |
>>> | 90% | 45.000000 | 626.000000 |
>>> | 99% | 188.960000 | 1073.000000 |
>>> | max | 369.000000 | 1417.000000 |
>>>
>>> - NO_STEAL:
>>>
>>> | stat | cpu_migrations | activations |
>>> |-------+----------------+-------------|
>>> | count | 2005.000000 | 2005.000000 |
>>> | mean | 15.260848 | 297.860848 |
>>> | std | 46.331890 | 253.210813 |
>>> | min | 0.000000 | 3.000000 |
>>> | 50% | 3.000000 | 252.000000 |
>>> | 75% | 7.000000 | 444.000000 |
>>> | 90% | 32.600000 | 643.600000 |
>>> | 99% | 214.880000 | 1127.520000 |
>>> | max | 467.000000 | 1547.000000 |
>>>
>>> ----->8-----
>>>
>>> Otherwise, my only other concern at the moment is that since stealing
>>> doesn't care about load, we could steal a task that would cause a big
>>> imbalance, which wouldn't have happened with a call to load_balance().
>>>
>>> I don't think this can be triggered with a symmetrical workload like
>>> hackbench, so I'll go explore something else.
>>
>> The dst is about to go idle with zero load, so stealing can only improve the
>> instantaneous balance between src and dst. For longer term average load, we
>> still rely on periodic load_balance to make adjustments.
>
> Right, so my line of thinking was that by not doing a load_balance() and
> taking a shortcut (stealing a task), we may end up just postponing a
> load_balance() to after we've stolen a task. I guess in those cases
> there's no magic trick to be found and we just have to deal with it.

In the current code I call idle_balance/load_balance first and then try_steal.
If idle_balance fails because of cost, then it has effectively postponed itself,
independently of stealing. The next successful call to load_balance will
correct any imbalance caused by stealing.

> And then there's some of the logic like we have in update_sd_pick_busiest()
> where we e.g. try to prevent misfit tasks from running on LITTLEs, but
> then if such tasks are waiting to be run and a LITTLE frees itself up,
> I *think* it's okay to steal it.

Should be OK to steal. If a BIG subsequently goes idle, load_balance will move
the task to the BIG, or the BIG may steal it when we support misfit stealing.

Questions for you, Valentin:

- Should misfit stealing be a separate patch, after my series? I prefer that,
so we get stealing in peoples hands as soon as possible. I think separating
it is OK because stealing should not cause any regression for misfits, as
my code still calls idle_balance/load_balance, which handles misfits.

- Who should implement misfit stealing -- you, me, someone else? I have no
preference.

- Steve