Re: [RFC PATCH] audit: Send netlink ACK before setting connection in auditd_set

From: Chris Riches
Date: Mon Oct 16 2023 - 13:41:40 EST


Hi Paul, thanks for the reply.


> I think the basic approach is good, but I think we can simply things
> slightly by using a resettable boolean as opposed to an integer flag
> for the ACK.  I've pasted in a quick, untested patch (below) to better
> demonstrate the idea, let me know what you think.

The simplified patch doesn't look quite right. I had some trouble
getting it to apply (seems tabs were changed into spaces, even when I
downloaded the raw email). While typing it out manually, I noticed that
the condition for sending the ACK isn't correct - if NLM_F_ACK is 0 to
begin with, then ack will be false to begin with, and so no ACK will be
sent even if there is an error. Handling this case is why I used the
three-state system to begin with. I think we could still use a boolean
with a condition of just (err || ack), with the caveat that we wouldn't
easily be able to send an early ACK on an error (not that the current
patch needs this - just thinking of reusability).


> Regardless of any other issues, I think you brought up a good point
> about there being socket buffer contention when the queues are
> full(ish) and an audit daemon connects to the kernel and while I'm not
> sure that this patch will fully resolve that issue, I do think it
> would be good to have (I'm doubtful if it can be fully resolved
> without some really awful hacks).

That roughly lines up with what I was seeing - could you elaborate any
further on what "awful hacks" would be needed to fully resolve this?
I'm intrigued as to where else the contention could be coming from that
wouldn't be covered by this patch.


> The ENOBUFS errors are coming from the netlink layer and are likely a
> sign of extreme load.  I'm not very familiar with the audit userspace
> code, but it might be helpful to see if you can increase the socket
> buffer size for the audit daemon.

I believe we tried increasing the buffer size in the toy repro and it
didn't make much difference - perhaps doing it in the real audit daemon
might help.


> I'm also not surprised to hear that as you increase the number of CPUs
> the problem goes away.  With more CPUs the system is able to schedule
> more threads simultaneously to maintain the kernel's audit queues and
> execute the audit daemon to drain both the netlink socket buffer and
> audit queues.

My intuition told me the opposite - that more parallel activity would
increase the chance of the socket buffer being crammed full before the
one thread could return to the audit daemon and allow it to start
processing events. Does the size/number of audit queues perhaps scale
with the number of CPUs?


> I suggest bringing this up with the audit userspace developer if you
> haven't already.  While we can, and should, improve things on the
> kernel side (e.g. the patch we are discussing), it also sounds like
> the userspace side has room for improvement as well.

That sounds like a good idea - can you point me to who that is? I was
under the impression that this was the mailing list for both the kernel
and userspace sides.


Thanks,
Chris