Re: tree rcu: call_rcu scalability problem?

From: Nick Piggin
Date: Wed Sep 02 2009 - 08:28:03 EST


On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote:
> Hi Paul,
>
> I'm testing out scalability of some vfs code paths, and I'm seeing
> a problem with call_rcu. This is a 2s8c opteron system, so nothing
> crazy.
>
> I'll show you the profile results for 1-8 threads:
>
> 1:
> 29768 total 0.0076
> 15550 default_idle 48.5938
> 1340 __d_lookup 3.6413
> 954 __link_path_walk 0.2559
> 816 system_call_after_swapgs 8.0792
> 680 kmem_cache_alloc 1.4167
> 669 dput 1.1946
> 591 __call_rcu 2.0521
>
> 2:
> 56733 total 0.0145
> 20074 default_idle 62.7313
> 3075 __call_rcu 10.6771
> 2650 __d_lookup 7.2011
> 2019 dput 3.6054
>
> 4:
> 98889 total 0.0253
> 21759 default_idle 67.9969
> 10994 __call_rcu 38.1736
> 5185 __d_lookup 14.0897
> 4475 dput 7.9911
>
> 8:
> 170391 total 0.0437
> 31815 __call_rcu 110.4688
> 12958 dput 23.1393
> 10417 __d_lookup 28.3071
>
> Of course there are other scalability factors involved too, but
> __call_rcu is taking 54 times more CPU to do 8 times the amount
> of work from 1-8 threads, or a factor of 6.7 slowdown.
>
> This is with tree RCU.

It seems like nearly 2/3 of the cost is here:
/* Add the callback to our list. */
*rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
rdp->nxttail[RCU_NEXT_TAIL] = &head->next;

In loading the pointer to the next tail pointer. If I'm reading the profile
correctly. Can't see why that should be a probem though...

ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
697 0.2172 :ffffffff8107dee0: push %r12
228 0.0710 :ffffffff8107dee2: push %rbp
133 0.0414 :ffffffff8107dee3: mov %rdx,%rbp
918 0.2860 :ffffffff8107dee6: push %rbx
316 0.0985 :ffffffff8107dee7: mov %rsi,0x8(%rdi)
257 0.0801 :ffffffff8107deeb: movq $0x0,(%rdi)
1660 0.5172 :ffffffff8107def2: mfence
27730 8.6394 :ffffffff8107def5: pushfq
13153 4.0979 :ffffffff8107def6: pop %r12
903 0.2813 :ffffffff8107def8: cli
2562 0.7982 :ffffffff8107def9: mov %gs:0xde68,%eax
1784 0.5558 :ffffffff8107df01: cltq
:ffffffff8107df03: mov 0x60(%rdx,%rax,8),%rbx
:ffffffff8107df08: pushfq
3494 1.0886 :ffffffff8107df09: pop %rdx
896 0.2792 :ffffffff8107df0a: cli
2655 0.8272 :ffffffff8107df0b: mov 0xd0(%rbp),%rcx
1800 0.5608 :ffffffff8107df12: cmp (%rbx),%rcx
21 0.0065 :ffffffff8107df15: je ffffffff8107df32 <__call_rcu+0x52
:ffffffff8107df17: mov 0x40(%rbx),%rax
81 0.0252 :ffffffff8107df1b: mov %rcx,(%rbx)
3 9.3e-04 :ffffffff8107df1e: mov %rax,0x38(%rbx)
:ffffffff8107df22: mov 0x48(%rbx),%rax
:ffffffff8107df26: mov %rax,0x40(%rbx)
:ffffffff8107df2a: mov 0x50(%rbx),%rax
:ffffffff8107df2e: mov %rax,0x48(%rbx)
:ffffffff8107df32: push %rdx
1194 0.3720 :ffffffff8107df33: popfq
9518 2.9654 :ffffffff8107df34: pushfq
4179 1.3020 :ffffffff8107df35: pop %rdx
1277 0.3979 :ffffffff8107df36: cli
2546 0.7932 :ffffffff8107df37: mov 0xc8(%rbp),%rax
1748 0.5446 :ffffffff8107df3e: cmp %rax,0x8(%rbx)
5 0.0016 :ffffffff8107df42: je ffffffff8107df57 <__call_rcu+0x77
:ffffffff8107df44: movb $0x1,0x19(%rbx)
2 6.2e-04 :ffffffff8107df48: movb $0x0,0x18(%rbx)
:ffffffff8107df4c: mov 0xc8(%rbp),%rax
:ffffffff8107df53: mov %rax,0x8(%rbx)
921 0.2869 :ffffffff8107df57: push %rdx
151 0.0470 :ffffffff8107df58: popfq
183507 57.1725 :ffffffff8107df59: mov 0x50(%rbx),%rax
995 0.3100 :ffffffff8107df5d: mov %rdi,(%rax)
2 6.2e-04 :ffffffff8107df60: mov %rdi,0x50(%rbx)
18 0.0056 :ffffffff8107df64: mov 0xd0(%rbp),%rdx
940 0.2929 :ffffffff8107df6b: mov 0xc8(%rbp),%rax
15 0.0047 :ffffffff8107df72: cmp %rax,%rdx
1 3.1e-04 :ffffffff8107df75: je ffffffff8107dfb0 <__call_rcu+0xd0
787 0.2452 :ffffffff8107df77: mov 0x58(%rbx),%rax
58 0.0181 :ffffffff8107df7b: inc %rax
2 6.2e-04 :ffffffff8107df7e: mov %rax,0x58(%rbx)
1679 0.5231 :ffffffff8107df82: movslq 0x4988fb(%rip),%rdx # ffff
40 0.0125 :ffffffff8107df89: cmp %rdx,%rax
5 0.0016 :ffffffff8107df8c: jg ffffffff8107dfd7 <__call_rcu+0xf7
588 0.1832 :ffffffff8107df8e: mov 0xe0(%rbp),%rdx
84 0.0262 :ffffffff8107df95: mov 0x51f924(%rip),%rax # ffff
5 0.0016 :ffffffff8107df9c: cmp %rax,%rdx
505 0.1573 :ffffffff8107df9f: js ffffffff8107dfc8 <__call_rcu+0xe8
17580 5.4771 :ffffffff8107dfa1: push %r12
1671 0.5206 :ffffffff8107dfa3: popfq
24201 7.5399 :ffffffff8107dfa4: pop %rbx
1367 0.4259 :ffffffff8107dfa5: pop %rbp
377 0.1175 :ffffffff8107dfa6: pop %r12
:ffffffff8107dfa8: retq
:ffffffff8107dfa9: nopl 0x0(%rax)
:ffffffff8107dfb0: mov %rbp,%rdi
:ffffffff8107dfb3: callq ffffffff813be930 <_spin_lock_irqs
12 0.0037 :ffffffff8107dfb8: mov %rbp,%rdi
:ffffffff8107dfbb: mov %rax,%rsi
:ffffffff8107dfbe: callq ffffffff8107d8e0 <rcu_start_gp>
:ffffffff8107dfc3: jmp ffffffff8107df77 <__call_rcu+0x97
:ffffffff8107dfc5: nopl (%rax)
:ffffffff8107dfc8: mov $0x1,%esi
10 0.0031 :ffffffff8107dfcd: mov %rbp,%rdi
:ffffffff8107dfd0: callq ffffffff8107dd50 <force_quiescent
1 3.1e-04 :ffffffff8107dfd5: jmp ffffffff8107dfa1 <__call_rcu+0xc1
451 0.1405 :ffffffff8107dfd7: mov $0x7fffffffffffffff,%rdx
411 0.1280 :ffffffff8107dfe1: xor %esi,%esi
:ffffffff8107dfe3: mov %rbp,%rdi
:ffffffff8107dfe6: mov %rdx,0x60(%rbx)
317 0.0988 :ffffffff8107dfea: callq ffffffff8107dd50 <force_quiescent
4510 1.4051 :ffffffff8107dfef: jmp ffffffff8107dfa1 <__call_rcu+0xc1
:ffffffff8107dff1: nopw %cs:0x0(%rax,%rax,1)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/