Re: wake_wide mechanism clarification

From: Joel Fernandes
Date: Thu Aug 03 2017 - 19:49:02 EST


Hi Michael,

Thanks for your reply.

On Wed, Aug 2, 2017 at 1:26 AM, Michael Wang <yun.wang@xxxxxxxxxxxxxxxx> wrote:
> Hi, Joel
>
> On 07/29/2017 10:13 AM, Joel Fernandes wrote:
>> +Michael Wang on his current email address (old one bounced). (my
>> reply was to Mike Galbraith but I also meant to CC Michael Wang for
>> the discussion). Thanks
>
> Just back from vacation and saw this long long discussion...
>
> I think guys explained well on the idea, wake_wide() just try to filter
> out the cases that may congest the waker's domain, as much as possible
> without side effect (ideally).
>
> The factor at very beginning is just a static number which picked by
> enormous times of practice testing, Peter Zijlstr suggest we add the
> domain size and make it a flexible factor, which by practice not that bad.

Yeah making it flexible might be better.

>
> So the simple answer is we use the llc_size since no better option at
> that time :-)

:-)

>
> But things changing very fast and new feature can introduce new cases,
> I'm pretty sure if we redo the testing the results will be very different,
> however, the idea itself still make sense to me, at least on theory.
>
> Recently I'm also thinking about the scheduler issue, cfs try to find out
> general solution for all these cases and the best answer is obviously, all
> the cases will suffer some damage and scheduler itself bloated to achieve
> the goal 'taking care of all'.

My team is definitely seeing some weirdness with wake_wide() itself
not working correctly. We're collecting more data and do a more
thorough analysis of the behavior. I think it can be improved. Will
update soon as we have something.

thanks,

-Joel

>
> Regards,
> Michael Wang
>
>
>
>>
>> On Sat, Jul 29, 2017 at 1:01 AM, Joel Fernandes <joelaf@xxxxxxxxxx> wrote:
>>> Hi Mike,
>>>
>>> I have take spent some time understanding the email thread and
>>> previous discussions. Unfortunately the second condition we are
>>> checking for in the wake_wide still didn't make sense to me (mentioned
>>> below) :-(
>>>
>>> On Fri, Jun 30, 2017 at 10:02 AM, Mike Galbraith
>>> <umgwanakikbuti@xxxxxxxxx> wrote:
>>>> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
>>>>> On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
>>>>>
>>>>>> That makes sense that we multiply slave's flips by a factor because
>>>>>> its low, but I still didn't get why the factor is chosen to be
>>>>>> llc_size instead of something else for the multiplication with slave
>>>>>> (slave * factor).
>>>>
>>>>> Yeah I don't know why llc_size was chosen...
>>>>
>>>> static void update_top_cache_domain(int cpu)
>>>> {
>>>> struct sched_domain_shared *sds = NULL;
>>>> struct sched_domain *sd;
>>>> int id = cpu;
>>>> int size = 1;
>>>>
>>>> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>>>> if (sd) {
>>>> id = cpumask_first(sched_domain_span(sd));
>>>> size = cpumask_weight(sched_domain_span(sd));
>>>> sds = sd->shared;
>>>> }
>>>>
>>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>>> per_cpu(sd_llc_size, cpu) = size;
>>>>
>>>> The goal of wake wide was to approximate when pulling would be a futile
>>>> consolidation effort and counterproductive to scaling. 'course with
>>>> ever increasing socket size, any 1:N waker is ever more likely to run
>>>> out of CPU for its one and only self (slamming into scaling wall)
>>>> before it needing to turn its minions loose to conquer the world.
>>>
>>> Actually the original question was why do we have the second condition
>>> as "master < slave * factor", instead of "master < factor". that's
>>> what didn't make sense to me. Why don't we return 0 from wake_wide if
>>> master < factor ?
>>>
>>> Infact, as the factor is set to the llc_size, I think the condition
>>> that makes sense to me is:
>>>
>>> if ((master + slave) < llc_size)
>>> return 0;
>>>
>>> In other words, if the master flips and the slave flips are totally
>>> higher than the llc_size, then we are most likely waking up too many
>>> tasks as affine and should then switch to wide to prevent overloading.
>>>
>>> Digging further into the original patch from Michael Wang (I also CC'd
>>> him), this was the code (before you had changed it to master/slave):
>>>
>>> wakee->nr_wakee_switch > factor &&
>>> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>>>
>>> To explain the second condition above, Michael Wang said the following in [1]
>>>
>>> "Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
>>> tasks rely on it, then waker's higher latency will damage all of them, pull
>>> wakee seems to be a bad deal."
>>>
>>> Again I didn't follow why the second condition couldn't just be:
>>> waker->nr_wakee_switch > factor, or, (waker->nr_wakee_switch +
>>> wakee->nr_wakee_switch) > factor, based on the above explanation from
>>> Micheal Wang that I quoted.
>>> and why he's instead doing the whole multiplication thing there that I
>>> was talking about earlier: "factor * wakee->nr_wakee_switch".
>>>
>>> Rephrasing my question in another way, why are we talking the ratio of
>>> master/slave instead of the sum when comparing if its > factor? I am
>>> surely missing something here.
>>>
>>> Just taking an example:
>>>
>>> Say we have llc_size = 3, we have 3 masters M1, M2 and M3. M1 has 8
>>> slaves, M2 has 4 slaves and M3 has 4 slaves. Only 1 slave is common
>>> between all 3 masters. Also to make it a bit more interesting, let s8
>>> wake up some random task T0. A diagram to show the master/slave
>>> relation ships could look like:
>>>
>>> +-----+
>>> | |
>>> +------------------------+ +------+ M2 |
>>> | | | | |
>>> | M1 | | +--+--+----
>>> | | | | | | |
>>> | | | | | | s15
>>> +--+--+--+--+--+--+---+--+---+ v v v
>>> | | | | | | | | s9 s10 s11
>>> v v v v v v v v
>>> s1 s2 s3 s4 s5 s6 s7 s8 ---> T0
>>> ^
>>> |
>>> +-+---+
>>> | |
>>> | M3 |
>>> | |
>>> +--+--+-----
>>> | | | |
>>> v v v v
>>> s12 s13 s14 s16
>>>
>>>
>>> Lets consider the case of M1 waking up s8. As per the above diagram,
>>> M1 has 8 flips and s8 has 4 flips.
>>>
>>> With llc_size = 3, the condition
>>>
>>> (slave < factor) would return FALSE, so then we would turn to the
>>> (master < slave * factor) condition. This would be TRUE (8 < 4 * 3),
>>> so wake_wide would return 0 and would cause s8 to be woken up as
>>> affine with relation to M1's core.
>>>
>>> So basically, it seems the heuristic is saying (with help of the
>>> second condition - master < slave * factor). that Its a good idea for
>>> s8 to be affine-woken-up with respect to M1's core. Why is it a good
>>> idea to do that? It seems to me M1 has already several tasks its
>>> waking as affine so causing s8 to be woken up affine could be harmful
>>> and it may be a better choice to wake it up elsewhere.
>>>
>>> Thanks for your help!
>>>
>>> -Joel
>>>
>>> [1] https://lkml.org/lkml/2013/7/4/20
>>>
>>>
>>>>
>>>> Something else to consider: network interrupt waking multiple workers
>>>> at high frequency. If the waking CPU is idle, do you really want to
>>>> place a worker directly in front of a tattoo artist, or is it better
>>>> off nearly anywhere but there?
>>>>
>>>> If the box is virtual, with no topology exposed (or real but ancient)
>>>> to let select_idle_sibling() come to the rescue, two workers can even
>>>> get tattooed simultaneously (see sync wakeup).
>>>>
>>>> -Mike