Re: [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and

From: Yuan,Zhaoxiong
Date: Fri Apr 30 2021 - 02:38:54 EST


> 在 2021/4/19 下午5:57,“Peter Zijlstra”<peterz@xxxxxxxxxxxxx> 写入:

> On Sat, Apr 17, 2021 at 11:01:37PM +0800, Yuan ZhaoXiong wrote:
>> On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
>> the others are used for housekeeping. When many housekeeping cpus are
>> in idle state, we can observe huge time burn in the loop for searching
>> nearest busy housekeeper cpu by ftrace.
>>
>> 9) | get_nohz_timer_target() {
>> 9) | housekeeping_test_cpu() {
>> 9) 0.390 us | housekeeping_get_mask.part.1();
>> 9) 0.561 us | }
>> 9) 0.090 us | __rcu_read_lock();
>> 9) 0.090 us | housekeeping_cpumask();
>> 9) 0.521 us | housekeeping_cpumask();
>> 9) 0.140 us | housekeeping_cpumask();
>>
>> ...
>>
>> 9) 0.500 us | housekeeping_cpumask();
>> 9) | housekeeping_any_cpu() {
>> 9) 0.090 us | housekeeping_get_mask.part.1();
>> 9) 0.100 us | sched_numa_find_closest();
>> 9) 0.491 us | }
>> 9) 0.100 us | __rcu_read_unlock();
>> 9) + 76.163 us | }
>>
>> for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
>> function the
>> for_each_cpu_and(i, sched_domain_span(sd),
>> housekeeping_cpumask(HK_FLAG_TIMER))
>> equals to below:
>> for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
>> housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
>> That will cause that housekeeping_cpumask() will be invoked many times.
>> The housekeeping_cpumask() function returns a const value, so it is
>> unnecessary to invoke it every time. This patch can minimize the worst
>> searching time from ~76us to ~16us in my testing.
>>
>> Similarly, the find_new_ilb() function has the same problem.

> Would it not make sense to mark housekeeping_cpumask() __pure instead?

> After marking housekeeping_cpumask() __pure and then test again, the results
> proves that huge time burn in the loop for searching the nearest busy housekeeper
> still exists.
>
> Using objdump -D vmlinux we can see get_nohz_timer_target() disassembled code
as below:
> ffffffff810b96c0 <get_nohz_timer_target>:
> ffffffff810b96c0: e8 db 7f 94 00 callq ffffffff81a016a0 <__fentry__>
> ffffffff810b96c5: 41 57 push %r15
> ffffffff810b96c7: 41 56 push %r14
> ffffffff810b96c9: 41 55 push %r13
> ffffffff810b96cb: 41 54 push %r12
> ffffffff810b96cd: 55 push %rbp
> ffffffff810b96ce: 53 push %rbx
> ffffffff810b96cf: 48 83 ec 08 sub $0x8,%rsp
> ffffffff810b96d3: 65 8b 1d 56 5a f5 7e mov %gs:0x7ef55a56(%rip),%ebx # f130 <cpu_number>
> ffffffff810b96da: 41 89 dc mov %ebx,%r12d
> ffffffff810b96dd: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
> ffffffff810b96e2: 4c 63 f3 movslq %ebx,%r14
> ffffffff810b96e5: 48 c7 c5 40 0b 02 00 mov $0x20b40,%rbp
> ffffffff810b96ec: 4a 8b 04 f5 20 77 13 mov -0x7dec88e0(,%r14,8),%rax
> ffffffff810b96f3: 82
> ffffffff810b96f4: 49 89 ed mov %rbp,%r13
> ffffffff810b96f7: 4c 01 e8 add %r13,%rax
> ffffffff810b96fa: 48 8b 88 90 09 00 00 mov 0x990(%rax),%rcx
> ffffffff810b9701: 48 39 88 88 09 00 00 cmp %rcx,0x988(%rax)
> ffffffff810b9708: 0f 84 ce 00 00 00 je ffffffff810b97dc <get_nohz_timer_target+0x11c>
> ffffffff810b970e: 48 83 c4 08 add $0x8,%rsp
> ffffffff810b9712: 44 89 e0 mov %r12d,%eax
> ffffffff810b9715: 5b pop %rbx
> ffffffff810b9716: 5d pop %rbp
> ffffffff810b9717: 41 5c pop %r12
> ffffffff810b9719: 41 5d pop %r13
> ffffffff810b971b: 41 5e pop %r14
> ffffffff810b971d: 41 5f pop %r15
> ffffffff810b971f: c3 retq
> ffffffff810b9720: be 01 00 00 00 mov $0x1,%esi
> ffffffff810b9725: 89 df mov %ebx,%edi
> ffffffff810b9727: e8 74 87 02 00 callq ffffffff810e1ea0 <housekeeping_test_cpu>
> ffffffff810b972c: 84 c0 test %al,%al
> ffffffff810b972e: 75 b2 jne ffffffff810b96e2 <get_nohz_timer_target+0x22>
> ffffffff810b9730: e8 0b ea 03 00 callq ffffffff810f8140 <__rcu_read_lock>
> ffffffff810b9735: 48 c7 c5 40 0b 02 00 mov $0x20b40,%rbp
> ffffffff810b973c: 48 63 d3 movslq %ebx,%rdx
> ffffffff810b973f: c7 44 24 04 ff ff ff movl $0xffffffff,0x4(%rsp)
> ffffffff810b9746: ff
> ffffffff810b9747: 48 89 e8 mov %rbp,%rax
> ffffffff810b974a: 48 03 04 d5 20 77 13 add -0x7dec88e0(,%rdx,8),%rax
> ffffffff810b9751: 82
> ffffffff810b9752: 4c 8b a8 d8 09 00 00 mov 0x9d8(%rax),%r13
> ffffffff810b9759: 4d 85 ed test %r13,%r13
> ffffffff810b975c: 0f 84 d3 00 00 00 je ffffffff810b9835 <get_nohz_timer_target+0x175>
> ffffffff810b9762: 41 be ff ff ff ff mov $0xffffffff,%r14d
> ffffffff810b9768: 4d 8d a5 38 01 00 00 lea 0x138(%r13),%r12
> ffffffff810b976f: 45 89 f7 mov %r14d,%r15d
> ffffffff810b9772: bf 01 00 00 00 mov $0x1,%edi
> ffffffff810b9777: e8 f4 86 02 00 callq ffffffff810e1e70 <housekeeping_cpumask>
> ffffffff810b977c: 44 89 ff mov %r15d,%edi
> ffffffff810b977f: 48 89 c2 mov %rax,%rdx
> ffffffff810b9782: 4c 89 e6 mov %r12,%rsi
> ffffffff810b9785: e8 b6 ea 79 00 callq ffffffff81858240 <cpumask_next_and>
> ffffffff810b978a: 3b 05 b4 4e 3e 01 cmp 0x13e4eb4(%rip),%eax # ffffffff8249e644 <nr_cpu_ids>
> ffffffff810b9790: 41 89 c7 mov %eax,%r15d
> ffffffff810b9793: 0f 83 84 00 00 00 jae ffffffff810b981d <get_nohz_timer_target+0x15d>
> ffffffff810b9799: 44 39 fb cmp %r15d,%ebx
> ffffffff810b979c: 74 d4 je ffffffff810b9772 <get_nohz_timer_target+0xb2>
> ffffffff810b979e: 49 63 c7 movslq %r15d,%rax
> ffffffff810b97a1: 48 89 ea mov %rbp,%rdx
> ffffffff810b97a4: 48 03 14 c5 20 77 13 add -0x7dec88e0(,%rax,8),%rdx
> ffffffff810b97ab: 82
> ffffffff810b97ac: 48 8b 82 90 09 00 00 mov 0x990(%rdx),%rax
> ffffffff810b97b3: 48 39 82 88 09 00 00 cmp %rax,0x988(%rdx)
> ffffffff810b97ba: 75 13 jne ffffffff810b97cf <get_nohz_timer_target+0x10f>
> ffffffff810b97bc: 8b 42 04 mov 0x4(%rdx),%eax
> ffffffff810b97bf: 85 c0 test %eax,%eax
> ffffffff810b97c1: 75 0c jne ffffffff810b97cf <get_nohz_timer_target+0x10f>
> ffffffff810b97c3: 48 8b 82 20 0c 00 00 mov 0xc20(%rdx),%rax
> ffffffff810b97ca: 48 85 c0 test %rax,%rax
> ffffffff810b97cd: 74 a3 je ffffffff810b9772 <get_nohz_timer_target+0xb2>
> ffffffff810b97cf: e8 1c 33 04 00 callq ffffffff810fcaf0 <__rcu_read_unlock>
> ffffffff810b97d4: 45 89 fc mov %r15d,%r12d
> ffffffff810b97d7: e9 32 ff ff ff jmpq ffffffff810b970e <get_nohz_timer_target+0x4e>
> ffffffff810b97dc: 8b 50 04 mov 0x4(%rax),%edx
> ffffffff810b97df: 85 d2 test %edx,%edx
> ffffffff810b97e1: 0f 85 27 ff ff ff jne ffffffff810b970e <get_nohz_timer_target+0x4e>
> ffffffff810b97e7: 48 8b 80 20 0c 00 00 mov 0xc20(%rax),%rax
> ffffffff810b97ee: 48 85 c0 test %rax,%rax
> ffffffff810b97f1: 0f 85 17 ff ff ff jne ffffffff810b970e <get_nohz_timer_target+0x4e>
> ffffffff810b97f7: e8 44 e9 03 00 callq ffffffff810f8140 <__rcu_read_lock>
> ffffffff810b97fc: 4e 03 2c f5 20 77 13 add -0x7dec88e0(,%r14,8),%r13
> ffffffff810b9803: 82
> ffffffff810b9804: 89 5c 24 04 mov %ebx,0x4(%rsp)
> ffffffff810b9808: 41 89 df mov %ebx,%r15d
> ffffffff810b980b: 4d 8b ad d8 09 00 00 mov 0x9d8(%r13),%r13
> ffffffff810b9812: 4d 85 ed test %r13,%r13
> ffffffff810b9815: 0f 85 47 ff ff ff jne ffffffff810b9762 <get_nohz_timer_target+0xa2>
> ffffffff810b981b: eb 12 jmp ffffffff810b982f <get_nohz_timer_target+0x16f>
> ffffffff810b981d: 4d 8b 6d 00 mov 0x0(%r13),%r13
> ffffffff810b9821: 4d 85 ed test %r13,%r13
> ffffffff810b9824: 0f 85 3e ff ff ff jne ffffffff810b9768 <get_nohz_timer_target+0xa8>
> ffffffff810b982a: 44 8b 7c 24 04 mov 0x4(%rsp),%r15d
> ffffffff810b982f: 41 83 ff ff cmp $0xffffffff,%r15d
> ffffffff810b9833: 75 9a jne ffffffff810b97cf <get_nohz_timer_target+0x10f>
> ffffffff810b9835: bf 01 00 00 00 mov $0x1,%edi
> ffffffff810b983a: e8 91 86 02 00 callq ffffffff810e1ed0 <housekeeping_any_cpu>
> ffffffff810b983f: 41 89 c7 mov %eax,%r15d
> ffffffff810b9842: eb 8b jmp ffffffff810b97cf <get_nohz_timer_target+0x10f>
> ffffffff810b9844: 66 90 xchg %ax,%ax
> ffffffff810b9846: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
> ffffffff810b984d: 00 00 00
>
> The disassembled code proves that the __pure mark does not work.

Until now, the __pure mark does not work in our test, should the patch be merged into the mainline?

Thanks,
Yuan ZhaoXiong