Re: tree rcu: call_rcu scalability problem?

From: Paul E. McKenney
Date: Wed Sep 02 2009 - 11:59:20 EST


On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote:
> On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote:
> > Hi Paul,
> >
> > I'm testing out scalability of some vfs code paths, and I'm seeing
> > a problem with call_rcu. This is a 2s8c opteron system, so nothing
> > crazy.
> >
> > I'll show you the profile results for 1-8 threads:
> >
> > 1:
> > 29768 total 0.0076
> > 15550 default_idle 48.5938
> > 1340 __d_lookup 3.6413
> > 954 __link_path_walk 0.2559
> > 816 system_call_after_swapgs 8.0792
> > 680 kmem_cache_alloc 1.4167
> > 669 dput 1.1946
> > 591 __call_rcu 2.0521
> >
> > 2:
> > 56733 total 0.0145
> > 20074 default_idle 62.7313
> > 3075 __call_rcu 10.6771
> > 2650 __d_lookup 7.2011
> > 2019 dput 3.6054
> >
> > 4:
> > 98889 total 0.0253
> > 21759 default_idle 67.9969
> > 10994 __call_rcu 38.1736
> > 5185 __d_lookup 14.0897
> > 4475 dput 7.9911

Four threads runs on one socket but 8 threads runs on two sockets,
I take it?

> > 8:
> > 170391 total 0.0437
> > 31815 __call_rcu 110.4688
> > 12958 dput 23.1393
> > 10417 __d_lookup 28.3071
> >
> > Of course there are other scalability factors involved too, but
> > __call_rcu is taking 54 times more CPU to do 8 times the amount
> > of work from 1-8 threads, or a factor of 6.7 slowdown.
> >
> > This is with tree RCU.
>
> It seems like nearly 2/3 of the cost is here:
> /* Add the callback to our list. */
> *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> rdp->nxttail[RCU_NEXT_TAIL] = &head->next;

Hmmm... That certainly is not the first list of code in call_rcu() that
would come to mind...

> In loading the pointer to the next tail pointer. If I'm reading the profile
> correctly. Can't see why that should be a probem though...

The usual diagnosis would be false sharing.

Hmmm... What is the workload? CPU-bound? If CONFIG_PREEMPT=n, I might
expect interference from force_quiescent_state(), except that it should
run only every few clock ticks. So this seems quite unlikely.

Could you please try padding the beginning and end of struct rcu_data
with a few hundred bytes and rerunning? Just in case there is a shared
per-CPU variable either before or after rcu_data in your memory layout?

Thanx, Paul

> ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
> 697 0.2172 :ffffffff8107dee0: push %r12
> 228 0.0710 :ffffffff8107dee2: push %rbp
> 133 0.0414 :ffffffff8107dee3: mov %rdx,%rbp
> 918 0.2860 :ffffffff8107dee6: push %rbx
> 316 0.0985 :ffffffff8107dee7: mov %rsi,0x8(%rdi)
> 257 0.0801 :ffffffff8107deeb: movq $0x0,(%rdi)
> 1660 0.5172 :ffffffff8107def2: mfence
> 27730 8.6394 :ffffffff8107def5: pushfq
> 13153 4.0979 :ffffffff8107def6: pop %r12
> 903 0.2813 :ffffffff8107def8: cli
> 2562 0.7982 :ffffffff8107def9: mov %gs:0xde68,%eax
> 1784 0.5558 :ffffffff8107df01: cltq
> :ffffffff8107df03: mov 0x60(%rdx,%rax,8),%rbx
> :ffffffff8107df08: pushfq
> 3494 1.0886 :ffffffff8107df09: pop %rdx
> 896 0.2792 :ffffffff8107df0a: cli
> 2655 0.8272 :ffffffff8107df0b: mov 0xd0(%rbp),%rcx
> 1800 0.5608 :ffffffff8107df12: cmp (%rbx),%rcx
> 21 0.0065 :ffffffff8107df15: je ffffffff8107df32 <__call_rcu+0x52
> :ffffffff8107df17: mov 0x40(%rbx),%rax
> 81 0.0252 :ffffffff8107df1b: mov %rcx,(%rbx)
> 3 9.3e-04 :ffffffff8107df1e: mov %rax,0x38(%rbx)
> :ffffffff8107df22: mov 0x48(%rbx),%rax
> :ffffffff8107df26: mov %rax,0x40(%rbx)
> :ffffffff8107df2a: mov 0x50(%rbx),%rax
> :ffffffff8107df2e: mov %rax,0x48(%rbx)
> :ffffffff8107df32: push %rdx
> 1194 0.3720 :ffffffff8107df33: popfq
> 9518 2.9654 :ffffffff8107df34: pushfq
> 4179 1.3020 :ffffffff8107df35: pop %rdx
> 1277 0.3979 :ffffffff8107df36: cli
> 2546 0.7932 :ffffffff8107df37: mov 0xc8(%rbp),%rax
> 1748 0.5446 :ffffffff8107df3e: cmp %rax,0x8(%rbx)
> 5 0.0016 :ffffffff8107df42: je ffffffff8107df57 <__call_rcu+0x77
> :ffffffff8107df44: movb $0x1,0x19(%rbx)
> 2 6.2e-04 :ffffffff8107df48: movb $0x0,0x18(%rbx)
> :ffffffff8107df4c: mov 0xc8(%rbp),%rax
> :ffffffff8107df53: mov %rax,0x8(%rbx)
> 921 0.2869 :ffffffff8107df57: push %rdx
> 151 0.0470 :ffffffff8107df58: popfq
> 183507 57.1725 :ffffffff8107df59: mov 0x50(%rbx),%rax
> 995 0.3100 :ffffffff8107df5d: mov %rdi,(%rax)
> 2 6.2e-04 :ffffffff8107df60: mov %rdi,0x50(%rbx)
> 18 0.0056 :ffffffff8107df64: mov 0xd0(%rbp),%rdx
> 940 0.2929 :ffffffff8107df6b: mov 0xc8(%rbp),%rax
> 15 0.0047 :ffffffff8107df72: cmp %rax,%rdx
> 1 3.1e-04 :ffffffff8107df75: je ffffffff8107dfb0 <__call_rcu+0xd0
> 787 0.2452 :ffffffff8107df77: mov 0x58(%rbx),%rax
> 58 0.0181 :ffffffff8107df7b: inc %rax
> 2 6.2e-04 :ffffffff8107df7e: mov %rax,0x58(%rbx)
> 1679 0.5231 :ffffffff8107df82: movslq 0x4988fb(%rip),%rdx # ffff
> 40 0.0125 :ffffffff8107df89: cmp %rdx,%rax
> 5 0.0016 :ffffffff8107df8c: jg ffffffff8107dfd7 <__call_rcu+0xf7
> 588 0.1832 :ffffffff8107df8e: mov 0xe0(%rbp),%rdx
> 84 0.0262 :ffffffff8107df95: mov 0x51f924(%rip),%rax # ffff
> 5 0.0016 :ffffffff8107df9c: cmp %rax,%rdx
> 505 0.1573 :ffffffff8107df9f: js ffffffff8107dfc8 <__call_rcu+0xe8
> 17580 5.4771 :ffffffff8107dfa1: push %r12
> 1671 0.5206 :ffffffff8107dfa3: popfq
> 24201 7.5399 :ffffffff8107dfa4: pop %rbx
> 1367 0.4259 :ffffffff8107dfa5: pop %rbp
> 377 0.1175 :ffffffff8107dfa6: pop %r12
> :ffffffff8107dfa8: retq
> :ffffffff8107dfa9: nopl 0x0(%rax)
> :ffffffff8107dfb0: mov %rbp,%rdi
> :ffffffff8107dfb3: callq ffffffff813be930 <_spin_lock_irqs
> 12 0.0037 :ffffffff8107dfb8: mov %rbp,%rdi
> :ffffffff8107dfbb: mov %rax,%rsi
> :ffffffff8107dfbe: callq ffffffff8107d8e0 <rcu_start_gp>
> :ffffffff8107dfc3: jmp ffffffff8107df77 <__call_rcu+0x97
> :ffffffff8107dfc5: nopl (%rax)
> :ffffffff8107dfc8: mov $0x1,%esi
> 10 0.0031 :ffffffff8107dfcd: mov %rbp,%rdi
> :ffffffff8107dfd0: callq ffffffff8107dd50 <force_quiescent
> 1 3.1e-04 :ffffffff8107dfd5: jmp ffffffff8107dfa1 <__call_rcu+0xc1
> 451 0.1405 :ffffffff8107dfd7: mov $0x7fffffffffffffff,%rdx
> 411 0.1280 :ffffffff8107dfe1: xor %esi,%esi
> :ffffffff8107dfe3: mov %rbp,%rdi
> :ffffffff8107dfe6: mov %rdx,0x60(%rbx)
> 317 0.0988 :ffffffff8107dfea: callq ffffffff8107dd50 <force_quiescent
> 4510 1.4051 :ffffffff8107dfef: jmp ffffffff8107dfa1 <__call_rcu+0xc1
> :ffffffff8107dff1: nopw %cs:0x0(%rax,%rax,1)
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/