Re: tip -ENOBOOT - bisected to locking/refcounts, x86/asm: Implement fast refcount overflow protection

From: Kees Cook
Date: Wed Aug 30 2017 - 15:19:19 EST


On Wed, Aug 30, 2017 at 10:55 AM, Mike Galbraith <efault@xxxxxx> wrote:
> On Wed, 2017-08-30 at 10:32 -0700, Kees Cook wrote:
>> On Wed, Aug 30, 2017 at 10:13 AM, Mike Galbraith <efault@xxxxxx> wrote:
>> > On Wed, 2017-08-30 at 09:35 -0700, Kees Cook wrote:
>> >> On Tue, Aug 29, 2017 at 10:02 PM, Mike Galbraith <efault@xxxxxx> wrote:
>> >> > On Tue, 2017-08-29 at 11:41 -0700, Kees Cook wrote:
>> >> >> Can you also test with 14afee4b6092 ("net: convert sock.sk_wmem_alloc
>> >> >> from atomic_t to refcount_t") reverted (instead of ARCH_HAS_REFCOUNT
>> >> >> disabled)?
>> >> >
>> >> > Nogo.
>> >>
>> >> Thanks for checking!
>> >>
>> >> > [ 44.901930] WARNING: CPU: 5 PID: 0 at net/netlink/af_netlink.c:374 netlink_sock_destruct+0x82/0xa0
>> >>
>> >> This is so odd if 14afee4b6092 is reverted. What is line 374 for you
>> >> in net/netlink/af_netlink.c?
>> >
>> > 374 WARN_ON(atomic_read(&sk->sk_rmem_alloc));
>> >
>> > That line is unchanged by 14afee4b6092.
>>
>> Uuuuhmm. Wow, now I'm really baffled. I thought you were getting the
>> warn from the next line with the refcount usage... I will keep
>> digging. Thanks!
>
> I just double checked freshly pulled tip (rapidly moving target), and
> it's definitely nogo with CONFIG_ARCH_HAS_REFCOUNT=y and 14afee4b6092
> reverted.

Okay, thanks for double-checking. I'd be curious to see if it would
help to revert any of these too:

fb5c2c17a556 net: convert packet_fanout.sk_ref from atomic_t to refcount_t
b4217b82893c net: convert netlbl_lsm_cache.refcount from atomic_t to refcount_t
c122e14df2d6 net: convert net.passive from atomic_t to refcount_t
edcb691871b2 net: convert inet_frag_queue.refcnt from atomic_t to refcount_t
717d1e993ad8 net: convert fib_rule.refcnt from atomic_t to refcount_t
8c9814b97002 net: convert unix_address.refcnt from atomic_t to refcount_t
433cea4d9bbb net: convert netpoll_info.refcnt from atomic_t to refcount_t
7658b36f1b31 net: convert in_device.refcnt from atomic_t to refcount_t
8851ab526791 net: convert ip_mc_list.refcnt from atomic_t to refcount_t
41c6d650f653 net: convert sock.sk_refcnt from atomic_t to refcount_t
14afee4b6092 net: convert sock.sk_wmem_alloc from atomic_t to refcount_t
2638595afccf net: convert sk_buff_fclones.fclone_ref from atomic_t to refcount_t
633547973ffc net: convert sk_buff.users from atomic_t to refcount_t
53869cebce4b net: convert nf_bridge_info.use from atomic_t to refcount_t
6343944bc105 net: convert neigh_params.refcnt from atomic_t to refcount_t
9f23743017d1 net: convert neighbour.refcnt from atomic_t to refcount_t
1cc9a98b59ba net: convert inet_peer.refcnt from atomic_t to refcount_t

What I find so strange about this is that it's tripping a warning in
an atomic_t state, and works fine with FULL or "none". x86-refcount
has almost identical behavior to "none" (though obviously, this seems
like those differences must be the source of the bug), but FULL is
much more aggressive. So some silent x86-refcount issue is causing a
side-effect in af_netlink... Something very subtle is happening here.
(So subtle I can't reproduce it yet, sadly...)

-Kees

--
Kees Cook
Pixel Security