Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

From: John Johansen
Date: Fri Feb 09 2024 - 12:33:48 EST

Next message: Deucher, Alexander: "RE: [PATCH v6 0/6] AMD Pstate Fixes And Enhancements"
Previous message: Conor Dooley: "Re: [PATCH V2] dt-bindings: media: convert Mediatek consumer IR to the json-schema"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2/6/24 20:40, Neeraj Upadhyay wrote:

Gentle ping.
John,

Could you please confirm that:

a. The AppArmor refcount usage described in the RFC is correct?
b. Approach taken to fix the scalability issue is valid/correct?

Hi Neeraj,

I know your patchset has been waiting on review for a long time.
Unfortunately I have been very, very busy lately. I will try to
get to it this weekend, but I can't promise that I will be able
to get the review fully done.

john

On 1/10/2024 4:41 PM, Neeraj Upadhyay wrote:

Problem Statement
=================

Nginx performance testing with Apparmor enabled (with nginx
running in unconfined profile), on kernel versions 6.1 and 6.5
show significant drop in throughput scalability, when Nginx
workers are scaled to higher number of CPUs across various
L3 cache domains.

Below is one sample data on the throughput scalability loss,
based on results on AMD Zen4 system with 96 CPUs with SMT
core count 2; so, overall, 192 CPUs:

Config Cache Domains apparmor=off apparmor=on
scaling eff (%) scaling eff (%)
8C16T 1 100% 100%
16C32T 2 95% 94%
24C48T 3 94% 93%
48C96T 6 92% 88%
96C192T 12 85% 68%

If we look at above data, there is a significant drop in
scaling efficiency, when we move to 96 CPUs/192 SMT threads.

Perf tool shows most of the contention coming from below
6.56% nginx [kernel.vmlinux] [k] apparmor_current_getsecid_subj
6.22% nginx [kernel.vmlinux] [k] apparmor_file_open

The majority of the CPU cycles is found to be due to memory contention
in atomic_fetch_add and atomic_fetch_sub operations from kref_get() and
kref_put() operations on label.

Commit 2516fde1fa00 ("apparmor: Optimize retrieving current task secid"),
from 6.7 alleviates the issue to an extent, but not completely:

Config Cache Domains apparmor=on apparmor=on (patched)
scaling eff (%) scaling eff (%)
8C16T 1 100% 100%
16C32T 2 97% 93%
24C48T 3 94% 92%
48C96T 6 88% 88%
96C192T 12 65% 79%

This adverse impact gets more pronounced when we move to >192 CPUs.
The memory contention and impact increases with high frequency label
update operations and labels are marked stale more frequently.

Label Refcount Management
=========================

Apparmor uses label objects (struct aa_label) to manage refcounts for
below set of objects:

- Applicable profiles
- Namespaces (unconfined profile)
- Other non-profile references

These label references are acquired on various apparmor lsm hooks,
on operations such as file open, task kill operations, socket bind,
and other file, socket, misc operations which use current task's cred,
when the label for the current cred, has been marked stale. This is
done to check these operations against the set of allowed operations
for the task performing them.

Use Percpu refcount for ref management?
=======================================

The ref put operations (percpu_ref_put()) in percpu refcount,
in active mode, do not check whether ref count has dropped to
0. The users of the percpu_ref need to explicitly invoke
a percpu_ref_kill() operation, to drop the initial reference,
at shutdown paths. After the percpu_ref_kill() operation, ref
switches to atomic mode and any new percpu_ref_put() operation
checks for the drop to 0 case and invokes the release operation
on that label.

Labels are marked stale is situations like profile removal,
profile updates. For a namespace, the unconfined label reference
is dropped when the namespace is destroyed. These points
are potential shutdown points for labels. However, killing
the percpu ref from these points has few issues:

- The label could still be referenced by tasks, which are
still holding the reference to the now stale label.
Killing the label ref while these operations are in progress
will make all subsequent ref-put operations on the stale label
to be atomic, which would still result in memory contention.
Also, any new reference to the stale label, which is acquired
with the elevated refcount will have atomic op contention.

- The label is marked stale using a non-atomic write operation.
It is possible that new operations do not observe this flag
and still reference it for quite some time.

- Explicitly tracking the shutdown points might not be maintainable
at a per label granularity, as there can be various paths where
label reference could get dropped, such as, before the label has
gone live - object initialization error paths. Also, tracking
the shutdown points for labels which reference other labels -
subprofiles, merged labels requires careful analysis, and adds
heavy burden on ensuring the memory contention is not introduced
by these ref kill points.

Proposed Solution
=================

One potential solution to the refcount scalability problem is to
convert the label refcount to a percpu refcount, and manage
the initial reference from kworker context. The kworker
keeps an extra reference to the label and periodically scans
labels and release them if their refcount drops to 0.

Below is the sequence of operations, which shows the refcount
management with this approach:

1. During label initialization, the percpu ref is initialized in
atomic mode. This is done to ensure that, for cases where the
label hasn't gone live (->ns isn't assigned), mostly during
initialization error paths.

2. Labels are switched to percpu mode at various points -
when a label is added to labelset tree, when a unconfined profile
has been assigned a namespace.

3. As part of the initial prototype, only the in tree labels
are managed by the kworker. These labels are added to a lockless
list. The unconfined labels invoke a percpu_ref_kill() operation
when the namespace is destroyed.

4. The kworker does a periodic scan of all the labels in the
llist. It does below sequence of operations:

a. Enqueue a dummy node to mark the start of scan. This dummy
node is used as start point of scan and ensures that we
there is no additional synchronization required with new
label node additions to the llist. Any new labels will
be processed in next run of the kworker.

SCAN START PTR
|
v
+----------+ +------+ +------+ +------+
| | | | | | | |
| head ------> dummy|--->|label |--->| label|--->NULL
| | | node | | | | |
+----------+ +------+ +------+ +------+

New label addition:

SCAN START PTR
|
v
+----------+ +------+ +------+ +------+ +------+
| | | | | | | | | |
| head |--> label|--> dummy|--->|label |--->| label|--->NULL
| | | | | node | | | | |
+----------+ +------+ +------+ +------+ +------+

b. Traverse through the llist, starting from dummy->next.
If the node is a dummy node, mark it free.
If the node is a label node, do,

i) Switch the label ref to atomic mode. The ref switch wait
for the existing percpu_ref_get() and percpu_ref_put()
operations to complete, by waiting for a RCU grace period.

Once the switch is complete, from this point onwards, any
percpu_ref_get(), percpu_ref_put() operations use
atomic operations.

ii) Drop the initial reference, which was taken while adding
the label node to the llist.

iii) Use a percpu_ref_tryget() increment operation on the
ref, to see if we dropped the last ref count. if we
dropped the last count, we remove the node from the llist.

All of these operations are done inside a RCU critical
section, to avoid race with the release operations,
which can potentially trigger, as soon as we drop
the initial ref count.

iv) If we didn't drop the last ref, switch back the counter
to percpu mode.

Using this approach, to move the atomic refcount manipulation out of the
contended paths, there is a significant scalability improvement seen on
nginx test, and scalability efficiency is close to apparmor-off case.

Config Cache Domains apparmor=on (percpuref)
scaling eff (%)
8C16T 1 100%
16C32T 2 96%
24C48T 3 94%
48C96T 6 93%
96C192T 12 90%

Limitations
===========

1. Switching to percpu refcount increases memory size overhead, as
percpu memory is allocated for all labels.

2. Deferring labels reclaim could potentially result in memory
pressure, when there are high frequency of label update operations.

3. Percpu refcount uses call_rcu_hurry() to complete switch operations.
These can impact energy efficiency, due to back to back hurry
callbacks. Using deferrable workqueue partly mitigates this.
However, deferring kworker can delay reclaims.

4. Back to back label switches can delay other percpu users, as
there is a single global switch spinlock used by percpu refcount
lib.

5. Long running kworker can delay other use cases like system suspend.
This is mitigated using freezable workqueue and litming node
scans to a max count.

6. There is a window where label operates is atomic mode, when its
counter is being checked for zero. This can potentially result
in high memory contention, during this window which spans RCU
grace period (plus callback execution). For example, when
scanning label corresponding to unconfined profile, all
applications which use unconfined profile would be using
atomic ref increment and decrement operations.

There are a few apparoaches which were tried to mitigate this issue:

a. At a lower time interval, check if scanned label's counter
has changed since the start of label scan. If there is a change
in count, terminate the switch to atomic mode. Below shows the
apparoch using rcuwait.

static void aa_label_switch_atomic_confirm(struct percpu_ref *label_ref)
{
WRITE_ONCE(aa_atomic_switch_complete, true);
rcuwait_wake_up(&aa_reclaim_rcuwait);
}

rcuwait_init(&aa_reclaim_rcuwait);
percpu_ref_switch_to_atomic(&label->count, aa_label_switch_atomic_confirm);

atomic_count = percpu_ref_count_read(&label->count);
do {
rcuwait_wait_event_timeout(&aa_reclaim_rcuwait,
(switch_complete = READ_ONCE(aa_atomic_switch_complete)),
TASK_IDLE,
msecs_to_jiffies(5));
if (percpu_ref_count_read(&label->count) != atomic_count)
break;
} while (!READ_ONCE(switch_complete));

However, this approach does not work, as percpu refcount lib does not
allow termination of an ongoing switch operation. Also, the counter
can return to the original value with set of get() and put() operations
before we check the current value.

b. Approaches to notify the reclaim kworker from ref get and put operations
can potentially disturb cache line state between the various CPU
contexts, which are referncing the label, and can potentially impact
scalability again.

c. Swith the label to an immortal percpu ref, while the scan operates
on the current counter.

Below is the sequence of operations to do this:

1. Ensure that both immortal ref and label ref are in percpu mode.
Reinit the immortal ref in percpu mode.

Swap percpu and atomic counters of label refcount and immortal ref
percpu-ref
+-------------------+
+-------+ | percpu-ctr-addr1 |
| label | --------->|-------------------| +----------------+
+-------+ | data |--->| Atomic counter1|
+-------------------+ +----------------+
+-------+ +-------------------+
|ImmLbl |---------->| percpu-ctr-addr2 | +----------------+
+-------+ |-------------------|--->| Atomic counter2|
| data | +----------------+
+-------------------+

label ->percpu-ctr-addr = percpu-ctr-addr2
ImmLbl ->percpu-ctr-addr = percpu-ctr-addr1
label ->data->count = Atomic counter2
ImmLbl ->data->count = Atomic counter1
2. Check the counters collected in immortal label, by switch it
to atomic mode.

3. If the count is 0, do,
a. Switch immortal counter to percpu again, giving it an
initial count of 1.
b. Swap the label and immortal counters again. The immortal
ref now has the counter values from new percpu ref get
and get operations on the label ref, from the point
when we did the initial swap operation.
c. Transfer the percpu counts in immortal ref to atomic
counter of label percpu refcount.
d. Kill immortal ref, for reinit on next iteration.
e. Switch label percpu ref to atomic mode.
f. If the counter is 1, drop the initial ref.

4. If the count is not 0, re-swap the counters.
a. Switch immortal counter to percpu again, giving it an
initial count of 1.
b. Swap the label and immortal counters again. The immortal
ref now has the counter values from new percpu ref get
and get operations on the label ref, from the point
when we did the initial swap operation.
c. Transfer the percpu counts in immortal ref to atomic
counter of label percpu refcount.
d. Kill immortal ref, for reinit on next iteration.

Using this approach, we ensure that, label ref users do not switch
to atomic mode, while there are active references on the label.
However, this approach requires multiple percpu ref mode switches
and adds high overhead and complexity to the scanning code.

Extended/Future Work
====================

1. Look for ways to fix the limitations, as described in the "Limitations"
section.

2. Generalize the approach to percpu rcuref, which is used for contexts
where release path uses RCU grace period for release operations. Patch
7 creates an initial prototype for this.

3. Explore hazard pointers for scalable refcounting of labels.

Highly appreciate any feedback/suggestions on the design approach.

The patches of this patchset introduce following changes:

1. Documentation of Apparmor Refcount management.

2. Switch labels to percpu refcount in atomic mode.

Use percpu refcount for apparmor labels. Initial patch to init
the percpu ref in atomic mode, to evaluate the potential
impact of percpuref on top of kref based implementation.

3. Switch unconfined namespaces refcount to percpu mode.

Switch unconfined ns labels to percpu mode, and kill the
initial refcount from namespace destroy path.

4. Add infrastructure to reclaim percpu labels.

Add a label reclaim infrastructure for labels which are
in percpu mode, for managing their inital refcount.

5. Switch intree labels to percpu mode.

Use label reclaim infrastruture to manage intree labels.

6. Initial prototype for optimizing ref switch.

Prototype for reducing the time window when a label
scan switches the label ref to atomic mode.

7. percpu-rcuref: Add basic infrastructure.

Prototype for Percpu refcounts for objects, which protect
their object reclaims using RCU grace period.

8. Switch labels to percpu rcurefcount in unmanaged mode.

Use percpu rcuref for labels. Start with unmanaged/atomic
mode.

9. Switch unconfined and in tree labels to managed ref mode.

Use percpu mode with manager worker for unconfined and intree
labels.

------------------------------------------------------------------------

b/Documentation/admin-guide/LSM/ApparmorRefcount.rst | 351 ++++++++++++++++++++++++++++++++++++++++++++++++++
b/Documentation/admin-guide/LSM/index.rst | 1
b/Documentation/admin-guide/kernel-parameters.txt | 8 +
b/include/linux/percpu-rcurefcount.h | 115 ++++++++++++++++
b/include/linux/percpu-refcount.h | 2
b/lib/Makefile | 2
b/lib/percpu-rcurefcount.c | 336 +++++++++++++++++++++++++++++++++++++++++++++++
b/lib/percpu-refcount.c | 93 +++++++++++++
b/security/apparmor/include/label.h | 16 +-
b/security/apparmor/include/policy.h | 8 -
b/security/apparmor/include/policy_ns.h | 24 +++
b/security/apparmor/label.c | 11 +
b/security/apparmor/lsm.c | 145 ++++++++++++++++++++
b/security/apparmor/policy_ns.c | 6
include/linux/percpu-refcount.h | 2
lib/percpu-refcount.c | 93 -------------
security/apparmor/include/label.h | 17 +-
security/apparmor/include/policy.h | 56 +++----
security/apparmor/include/policy_ns.h | 24 ---
security/apparmor/label.c | 11 -
security/apparmor/lsm.c | 325 ++++++++++++----------------------------------
security/apparmor/policy_ns.c | 8 -
22 files changed, 1237 insertions(+), 417 deletions(-)

base-commit: ab27740f7665

Next message: Deucher, Alexander: "RE: [PATCH v6 0/6] AMD Pstate Fixes And Enhancements"
Previous message: Conor Dooley: "Re: [PATCH V2] dt-bindings: media: convert Mediatek consumer IR to the json-schema"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]