Re: [tip:locking/core] refcount_t: Introduce a special purpose refcount type

From: Ingo Molnar
Date: Wed Feb 15 2017 - 04:05:03 EST



* Reshetova, Elena <elena.reshetova@xxxxxxxxx> wrote:

> > Subject: refcount: Out-of-line everything
> > From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > Date: Fri Feb 10 16:27:52 CET 2017
> >
> > Linus asked to please make this real C code.
>
> Perhaps a completely stupid question, but I am going to ask anyway since only
> this way I can learn. What a real difference it makes? Or are we talking more
> about styling and etc.? Is it because of size concerns? This way it is certainly
> now done differently than rest of atomic and kref...

It's about generated code size mostly.

This inline function is way too large to be inlined:

static inline __refcount_check
bool refcount_add_not_zero(unsigned int i, refcount_t *r)
{
unsigned int old, new, val = atomic_read(&r->refs);

for (;;) {
if (!val)
return false;

if (unlikely(val == UINT_MAX))
return true;

new = val + i;
if (new < val)
new = UINT_MAX;
old = atomic_cmpxchg_relaxed(&r->refs, val, new);
if (old == val)
break;

val = old;
}

REFCOUNT_WARN(new == UINT_MAX, "refcount_t: saturated; leaking memory.\n");

return true;
}

When used then this function generates this much code on x86-64 defconfig:

00000000000084d0 <test>:
84d0: 8b 0f mov (%rdi),%ecx
84d2: 55 push %rbp
84d3: 48 89 e5 mov %rsp,%rbp

84d6: 85 c9 test %ecx,%ecx |
84d8: 74 30 je 850a <test+0x3a> |
84da: 83 f9 ff cmp $0xffffffff,%ecx |
84dd: be ff ff ff ff mov $0xffffffff,%esi |
84e2: 75 04 jne 84e8 <test+0x18> |
84e4: eb 1d jmp 8503 <test+0x33> |
84e6: 89 c1 mov %eax,%ecx |
84e8: 8d 51 01 lea 0x1(%rcx),%edx |
84eb: 89 c8 mov %ecx,%eax |
84ed: 39 ca cmp %ecx,%edx |
84ef: 0f 42 d6 cmovb %esi,%edx |
84f2: f0 0f b1 17 lock cmpxchg %edx,(%rdi) |
84f6: 39 c8 cmp %ecx,%eax |
84f8: 74 09 je 8503 <test+0x33> |
84fa: 85 c0 test %eax,%eax |
84fc: 74 0c je 850a <test+0x3a> |
84fe: 83 f8 ff cmp $0xffffffff,%eax |
8501: 75 e3 jne 84e6 <test+0x16> |
8503: b8 01 00 00 00 mov $0x1,%eax |

8508: 5d pop %rbp
8509: c3 retq
850a: 31 c0 xor %eax,%eax
850c: 5d pop %rbp
850d: c3 retq


(I've annotated the fastpath impact with '|'. Out of line code generally does not
count.)

It's about ~50 bytes of code per usage site. It might be better in some cases, but
not by much.

This is way above any sane inlining threshold. The 'unconditionally good' inlining
threshold is at 1-2 instructions and below ~10 bytes of code.

So for example refcount_set() and refcount_read() can stay inlined:

static inline void refcount_set(refcount_t *r, unsigned int n)
{
atomic_set(&r->refs, n);
}

static inline unsigned int refcount_read(const refcount_t *r)
{
return atomic_read(&r->refs);
}


... beacuse they compile into a single instruction with 2-5 bytes I$ overhead.

Thanks,

Ingo