çå: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node

From: yangjihong
Date: Sat Dec 16 2017 - 05:29:18 EST


>On 12/15/2017 08:56 AM, Stephen Smalley wrote:
>> On Fri, 2017-12-15 at 03:09 +0000, yangjihong wrote:
>>> On 12/15/2017 10:31 PM, yangjihong wrote:
>>>> On 12/14/2017 12:42 PM, Casey Schaufler wrote:
>>>>> On 12/14/2017 9:15 AM, Stephen Smalley wrote:
>>>>>> On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote:
>>>>>>> On 12/14/2017 8:42 AM, Stephen Smalley wrote:
>>>>>>>> On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote:
>>>>>>>>> On 12/13/2017 7:18 AM, Stephen Smalley wrote:
>>>>>>>>>> On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> I am doing stressing testing on 3.10 kernel(centos 7.4), to
>>>>>>>>>>> constantly starting numbers of docker ontainers with selinux
>>>>>>>>>>> enabled, and after about 2 days, the kernel softlockup panic:
>>>>>>>>>>> <IRQ> [<ffffffff810bb778>]
>>>>>>>>>>> sched_show_task+0xb8/0x120
>>>>>>>>>>> [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
>>>>>>>>>>> [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
>>>>>>>>>>> [<ffffffff811224d0>] ?
>>>>>>>>>>> watchdog_enable_all_cpus.part.4+0x40/0x40
>>>>>>>>>>> [<ffffffff810abf82>]
>>>>>>>>>>> __hrtimer_run_queues+0xd2/0x260
>>>>>>>>>>> [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
>>>>>>>>>>> [<ffffffff8104a477>]
>>>>>>>>>>> local_apic_timer_interrupt+0x37/0x60
>>>>>>>>>>> [<ffffffff8166fd90>]
>>>>>>>>>>> smp_apic_timer_interrupt+0x50/0x140
>>>>>>>>>>> [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
>>>>>>>>>>> <EOI> [<ffffffff812b4193>] ?
>>>>>>>>>>> sidtab_context_to_sid+0xb3/0x480
>>>>>>>>>>> [<ffffffff812b41f0>] ?
>>>>>>>>>>> sidtab_context_to_sid+0x110/0x480
>>>>>>>>>>> [<ffffffff812c0d15>] ?
>>>>>>>>>>> mls_setup_user_range+0x145/0x250
>>>>>>>>>>> [<ffffffff812bd477>]
>>>>>>>>>>> security_get_user_sids+0x3f7/0x550
>>>>>>>>>>> [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
>>>>>>>>>>> [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
>>>>>>>>>>> [<ffffffff812b01d8>]
>>>>>>>>>>> selinux_transaction_write+0x48/0x80
>>>>>>>>>>> [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
>>>>>>>>>>> [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
>>>>>>>>>>> [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
>>>>>>>>>>>
>>>>>>>>>>> My opinion:
>>>>>>>>>>> when the docker container starts, it would mount overlay
>>>>>>>>>>> filesystem with different selinux context, mount point such
>>>>>>>>>>> as:
>>>>>>>>>>> overlay on
>>>>>>>>>>> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952ea
>>>>>>>>>>> e4f6cb0f
>>>>>>>>>>> 07b4
>>>>>>>>>>> bc32
>>>>>>>>>>> 6cb07495ca08fc9ddb66/merged type overlay
>>>>>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox
>>>>>>>>>>> _file_t:
>>>>>>>>>>> s0:c
>>>>>>>>>>> 414,
>>>>>>>>>>> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV
>>>>>>>>>>> 5CFWLADP
>>>>>>>>>>> ARHH
>>>>>>>>>>> WY7:
>>>>>>>>>>> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS
>>>>>>>>>>> :/var/li
>>>>>>>>>>> b/do
>>>>>>>>>>> cker
>>>>>>>>>>> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/
>>>>>>>>>>> lib/dock
>>>>>>>>>>> er/o
>>>>>>>>>>> verl
>>>>>>>>>>> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07
>>>>>>>>>>> 495ca08f
>>>>>>>>>>> c9dd
>>>>>>>>>>> b66/
>>>>>>>>>>> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92f
>>>>>>>>>>> c4530e0e
>>>>>>>>>>> 952e
>>>>>>>>>>> ae4f
>>>>>>>>>>> 6cb0f07b4bc326cb07495ca08fc9ddb66/work)
>>>>>>>>>>> shm on
>>>>>>>>>>> /var/lib/docker/containers/9fd65e177d2132011d7b422755
>>>>>>>>>>> 793449c9
>>>>>>>>>>> 1327
>>>>>>>>>>> ca57
>>>>>>>>>>> 7b8f5d9d6a4adf218d4876/shm type tmpfs
>>>>>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:ob
>>>>>>>>>>> ject_r:s
>>>>>>>>>>> virt
>>>>>>>>>>> _san
>>>>>>>>>>> dbox_file_t:s0:c414,c873",size=65536k)
>>>>>>>>>>> overlay on
>>>>>>>>>>> /var/lib/docker/overlay2/38d1544d080145c7d76150530d02
>>>>>>>>>>> 55991dfb
>>>>>>>>>>> 7258
>>>>>>>>>>> cbca
>>>>>>>>>>> 14ff6d165b94353eefab/merged type overlay
>>>>>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox
>>>>>>>>>>> _file_t:
>>>>>>>>>>> s0:c
>>>>>>>>>>> 431,
>>>>>>>>>>> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLF
>>>>>>>>>>> B7ANVRHP
>>>>>>>>>>> AVRC
>>>>>>>>>>> RSS:
>>>>>>>>>>> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI
>>>>>>>>>>> ,upperdi
>>>>>>>>>>> r=/v
>>>>>>>>>>> ar/l
>>>>>>>>>>> ib/docker/overlay2/38d1544d080145c7d76150530d0255991d
>>>>>>>>>>> fb7258cb
>>>>>>>>>>> ca14
>>>>>>>>>>> ff6d
>>>>>>>>>>> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/
>>>>>>>>>>> 38d1544d
>>>>>>>>>>> 0801
>>>>>>>>>>> 45c7
>>>>>>>>>>> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work
>>>>>>>>>>> )
>>>>>>>>>>> shm on
>>>>>>>>>>> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944
>>>>>>>>>>> 537a4bce
>>>>>>>>>>> dc1d
>>>>>>>>>>> cf05
>>>>>>>>>>> a65866458523ffd4a71614/shm type tmpfs
>>>>>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:ob
>>>>>>>>>>> ject_r:s
>>>>>>>>>>> virt
>>>>>>>>>>> _san
>>>>>>>>>>> dbox_file_t:s0:c431,c651",size=65536k)
>>>>>>>>>>>
>>>>>>>>>>> sidtab_search_context check the context whether is in the
>>>>>>>>>>> sidtab list, If not found, a new node is generated and insert
>>>>>>>>>>> into the list, As the number of containers is increasing,
>>>>>>>>>>> context nodes are also more and more, we tested the final
>>>>>>>>>>> number of nodes reached
>>>>>>>>>>> 300,000 +,
>>>>>>>>>>> sidtab_context_to_sid runtime needs 100-200ms, which will
>>>>>>>>>>> lead to the system softlockup.
>>>>>>>>>>>
>>>>>>>>>>> Is this a selinux bug? When filesystem umount, why context
>>>>>>>>>>> node is not deleted? I cannot find the relevant function to
>>>>>>>>>>> delete the node in sidtab.c
>>>>>>>>>>>
>>>>>>>>>>> Thanks for reading and looking forward to your reply.
>>>>>>>>>> So, does docker just keep allocating a unique category set for
>>>>>>>>>> every new container, never reusing them even if the container
>>>>>>>>>> is destroyed?
>>>>>>>>>> That would be a bug in docker IMHO. Or are you creating an
>>>>>>>>>> unbounded number of containers and never destroying the older
>>>>>>>>>> ones?
>>>>>>>>> You can't reuse the security context. A process in ContainerA
>>>>>>>>> sends a labeled packet to MachineB. ContainerA goes away and
>>>>>>>>> its context is recycled in ContainerC. MachineB responds some
>>>>>>>>> time later, again with a labeled packet. ContainerC gets
>>>>>>>>> information intended for ContainerA, and uses the information
>>>>>>>>> to take over the Elbonian government.
>>>>>>>> Docker isn't using labeled networking (nor is anything else by
>>>>>>>> default; it is only enabled if explicitly configured).
>>>>>>> If labeled networking weren't an issue we'd have full security
>>>>>>> module stacking by now. Yes, it's an edge case. If you want to
>>>>>>> use labeled NFS or a local filesystem that gets mounted in each
>>>>>>> container (don't tell me that nobody would do that) you've got
>>>>>>> the same problem.
>>>>>> Even if someone were to configure labeled networking, Docker is
>>>>>> not presently relying on that or SELinux network enforcement for
>>>>>> any security properties, so it really doesn't matter.
>>>>> True enough. I can imagine a use case, but as you point out, it
>>>>> would be a very complex configuration and coordination exercise
>>>>> using SELinux.
>>>>>
>>>>>> And if they wanted
>>>>>> to do that, they'd have to coordinate category assignments across
>>>>>> all systems involved, for which no facility exists AFAIK. If you
>>>>>> have two docker instances running on different hosts, I'd wager
>>>>>> that they can hand out the same category sets today to different
>>>>>> containers.
>>>>>>
>>>>>> With respect to labeled NFS, that's also not the default for nfs
>>>>>> mounts, so again it is a custom configuration and Docker isn't
>>>>>> relying on it for any guarantees today. For local filesystems,
>>>>>> they would normally be context-mounted or using genfscon rather
>>>>>> than xattrs in order to be accessible to the container, thus no
>>>>>> persistent storage of the category sets.
>>>> Well Kubernetes and OpenShift do set the labels to be the same
>>>> within a project, and they can manage across nodes. But yes we are
>>>> not using labeled networking at this point.
>>>>> I know that is the intended configuration, but I see people do all
>>>>> sorts of stoopid things for what they believe are good reasons.
>>>>> Unfortunately, lots of people count on containers to provide
>>>>> isolation, but create "solutions" for data sharing that defeat it.
>>>>>
>>>>>> Certainly docker could provide an option to not reuse category
>>>>>> sets, but making that the default is not sane and just guarantees
>>>>>> exhaustion of the SID and context space (just create and tear down
>>>>>> lots of containers every day or more frequently).
>>>>> It seems that Docker might have a similar issue with UIDs, but it
>>>>> takes longer to run out of UIDs than sidtab entries.
>>>>>
>>>>>>>>>> On the selinux userspace side, we'd also like to eliminate the
>>>>>>>>>> use of /sys/fs/selinux/user (sel_write_user ->
>>>>>>>>>> security_get_user_sids) entirely, which is what triggered this
>>>>>>>>>> for you.
>>>>>>>>>>
>>>>>>>>>> We cannot currently delete a sidtab node because we have no
>>>>>>>>>> way of knowing if there are any lingering references to the
>>>>>>>>>> SID.
>>>>>>>>>> Fixing that would require reference-counted SIDs, which goes
>>>>>>>>>> beyond just SELinux since SIDs/secids are returned by LSM
>>>>>>>>>> hooks and cached in other kernel data structures.
>>>>>>>>> You could delete a sidtab node. The code already deals with
>>>>>>>>> unfindable SIDs. The issue is that eventually you run out of
>>>>>>>>> SIDs.
>>>>>>>>> Then you are forced to recycle SIDs, which leads to the
>>>>>>>>> overthrow of the Elbonian government.
>>>>>>>> We don't know when we can safely delete a sidtab node since SIDs
>>>>>>>> aren't reference counted and we can't know whether it is still
>>>>>>>> in use somewhere in the kernel. Doing so prematurely would lead
>>>>>>>> to the SID being remapped to the unlabeled context, and then
>>>>>>>> likely to undesired denials.
>>>>>>> I would suggest that if you delete a sidtab node and someone
>>>>>>> comes along later and tries to use it that denial is exactly what
>>>>>>> you would desire. I don't see any other rational action.
>>>>>> Yes, if we know that the SID wasn't in use at the time we tore it
>>>>>> down.
>>>>>> But if we're just randomly deleting sidtab entries based on age
>>>>>> or something (since we have no reference count), we'll almost
>>>>>> certainly encounter situations where a SID hasn't been accessed in
>>>>>> a long time but is still being legitimately cached somewhere.
>>>>>> Just a file that hasn't been accessed in a while might have that
>>>>>> SID still cached in its inode security blob, or anywhere else.
>>>>>>
>>>>>>>>>> sidtab_search_context() could no doubt be optimized for the
>>>>>>>>>> negative case; there was an earlier optimization for the
>>>>>>>>>> positive case by adding a cache to sidtab_context_to_sid()
>>>>>>>>>> prior to calling it. It's a reverse lookup in the sidtab.
>>>>>>>>> This seems like a bad idea.
>>>>>>>> Not sure what you mean, but it can certainly be changed to at
>>>>>>>> least use a hash table for these reverse lookups.
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>> Thanks for reply and discussion.
>>> I think docker container is only a case, Is it possible there is a
>>> similar way, through some means of attack, triggered a constantly
>>> increasing of SIDs list, eventually leading to the system panic?
>>>
>>> I think the issue is that is takes too long to search SID node when
>>> SIDs list too large, If can optimize the node's data structure(ie :
>>> tree structure) or search algorithm to ensure that traversing all
>>> nodes can be very short time even in many nodes, maybe it can solve
>>> the problem.
>>> Or, in sidtab.c provides "delete_sidtab_node" interface, when umount
>>> fs, delete the SID node. Because when fs is umounted, the SID is
>>> useless, could delete it to control the size of SIDs list.
>>>
>>> Thanks for reading and looking forward to your reply.
>> We cannot safely delete entries in the sidtab without first adding
>> reference counting of SIDs, which goes beyond just SELinux since they
>> are cached in other kernel data structures and returned by LSM hooks.
>> That's a non-trivial undertaking.
>>
>> Far more practical in the near term would be to introduce a hash table
>> or other mechanism for efficient reverse lookups in the sidtab. Are
>> you offering to implement that or just requesting it?
>>
Because I'm not very familiar with the overall architecture of selinux, so may be could not offer to implement, sorry.
Or please tell me what I can do if I can help.
If there is any progress(ie determine the solution or optimization method), could you please inform me about it? thanks!

>> Independent of that, docker should support reuse of category sets when
>> containers are deleted, at least as an option and probably as the
>> default.
>>
>>
>Docker does reuse categories of containers that are removed, by default.

Thanks for reading and looking forward to your reply.
Best wishes!