Re: [RFC PATCH v3 00/16] Core scheduling v3

From: Li, Aubrey
Date: Thu Jul 25 2019 - 17:43:03 EST


On 2019/7/25 22:30, Aaron Lu wrote:
> On Mon, Jul 22, 2019 at 06:26:46PM +0800, Aubrey Li wrote:
>> The granularity period of util_avg seems too large to decide task priority
>> during pick_task(), at least it is in my case, cfs_prio_less() always picked
>> core max task, so pick_task() eventually picked idle, which causes this change
>> not very helpful for my case.
>>
>> <idle>-0 [057] dN.. 83.716973: __schedule: max: sysbench/2578
>> ffff889050f68600
>> <idle>-0 [057] dN.. 83.716974: __schedule:
>> (swapper/5/0;140,0,0) ?< (mysqld/2511;119,1042118143,0)
>> <idle>-0 [057] dN.. 83.716975: __schedule:
>> (sysbench/2578;119,96449836,0) ?< (mysqld/2511;119,1042118143,0)
>> <idle>-0 [057] dN.. 83.716975: cfs_prio_less: picked
>> sysbench/2578 util_avg: 20 527 -507 <======= here===
>> <idle>-0 [057] dN.. 83.716976: __schedule: pick_task cookie
>> pick swapper/5/0 ffff889050f68600
>
> I tried a different approach based on vruntime with 3 patches following.
>
> When the two tasks are on the same CPU, no change is made, I still route
> the two sched entities up till they are in the same group(cfs_rq) and
> then do the vruntime comparison.
>
> When the two tasks are on differen threads of the same core, the root
> level sched_entities to which the two tasks belong will be used to do
> the comparison.
>
> An ugly illustration for the cross CPU case:
>
> cpu0 cpu1
> / | \ / | \
> se1 se2 se3 se4 se5 se6
> / \ / \
> se21 se22 se61 se62
>
> Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while
> task B's se is se61. To compare priority of task A and B, we compare
> priority of se2 and se6. The smaller vruntime wins.
>
> To make this work, the root level ses on both CPU should have a common
> cfs_rq min vuntime, which I call it the core cfs_rq min vruntime.
>
> This is mostly done in patch2/3.
>
> Test:
> 1 wrote an cpu intensive program that does nothing but while(1) in
> main(), let's call it cpuhog;
> 2 start 2 cgroups, with one cgroup's cpuset binding to CPU2 and the
> other binding to cpu3. cpu2 and cpu3 are smt siblings on the test VM;
> 3 enable cpu.tag for the two cgroups;
> 4 start one cpuhog task in each cgroup;
> 5 kill both cpuhog tasks after 10 seconds;
> 6 check each cgroup's cpu usage.
>
> If the task is scheduled fairly, then each cgroup's cpu usage should be
> around 5s.
>
> With v3, the cpu usage of both cgroups are sometimes 3s, 7s; sometimes
> 1s, 9s.
>
> With the 3 patches applied, the numbers are mostly around 5s, 5s.
>
> Another test is starting two cgroups simultaneously with cpu.tag set,
> with one cgroup running: will-it-scale/page_fault1_processes -t 16 -s 30,
> the other running: will-it-scale/page_fault2_processes -t 16 -s 30.
> With v3, like I said last time, the later started page_fault processes
> can't start running. With the 3 patches applied, both running at the
> same time with each CPU having a relatively fair score:
>
> output line of 16 page_fault1 processes in 1 second interval:
> min:105225 max:131716 total:1872322
>
> output line of 16 page_fault2 processes in 1 second interval:
> min:86797 max:110554 total:1581177
>
> Note the value in min and max, the smaller the gap is, the better the
> faireness is.
>
> Aubrey,
>
> I haven't been able to run your workload yet...
>

No worry, let me try to see how it works.

Thanks,
-Aubrey