Re: [RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg

From: Yang Shi
Date: Tue Oct 24 2017 - 20:35:00 EST




On 10/23/17 10:42 PM, Amir Goldstein wrote:
On Tue, Oct 24, 2017 at 7:12 AM, Yang Shi <yang.s@xxxxxxxxxxxxxxx> wrote:


On 10/22/17 1:24 AM, Amir Goldstein wrote:

On Sat, Oct 21, 2017 at 12:07 AM, Yang Shi <yang.s@xxxxxxxxxxxxxxx> wrote:



On 10/19/17 8:14 PM, Amir Goldstein wrote:


On Fri, Oct 20, 2017 at 12:20 AM, Yang Shi <yang.s@xxxxxxxxxxxxxxx>
wrote:


We observed some misbehaved user applications might consume significant
amount of fsnotify slabs silently. It'd better to account those slabs
in
kmemcg so that we can get heads up before misbehaved applications use
too
much memory silently.



In what way do they misbehave? create a lot of marks? create a lot of
events?
Not reading events in their queue?



It looks both a lot marks and events. I'm not sure if it is the latter
case.
If I knew more about the details of the behavior, I would elaborated more
in
the commit log.


If you are not sure, do not refer to user application as "misbehaved".
Is updatedb(8) a misbehaved application because it produces a lot of
access
events?


Should be not. It sounds like our in-house applications. But, it is a sort
of blackbox to me.


If you know which process is "misbehaving" you can look at
ls -l /proc/<pid>/fd |grep notify
and see the anonymous inotify/fanotify file descriptors

then you can look at /proc/<pid>/fdinfo/<fd> file of those
file descriptors to learn more about the fanotify flags etc.

Thanks for the hints.


...



But I think there is another problem, not introduced by your change, but
could
be amplified because of it - when a non-permission event allocation fails,
the
event is silently dropped, AFAICT, with no indication to listener.
That seems like a bug to me, because there is a perfectly safe way to deal
with
event allocation failure - queue the overflow event.


I'm not sure if such issue could be amplified by the accounting since once
the usage exceeds the limit any following kmem allocation would fail. So, it
might fail at fsnotify event allocation, or other places, i.e. fork, open
syscall, etc. So, in most cases the generator even can't generate new event
any more.


To be clear, I did not mean that kmem limit would cause a storm of dropped
events. I meant if you have a listener outside memcp watching a single file
for access/modifications and you have many containers each with its own
limited memcg, then event drops probability goes to infinity as you run more
of those kmem limited containers with event producers.

The typical output from my LTP test is filesystem dcache allocation error or
fork error due to kmem limit is reached.

And that should be considered a success result of the test.
The only failure case is when producer touches the file and event is
not delivered
nor an overflow event delivered.
You can probably try to reduce allocation failure for fork and dentry by:
1. pin dentry cache of subject file on test init by opening the file
2. set the low kmem limit after forking

Then you should probably loop the test enough times
in some of the times, producer may fail to access the file
in others if will succeed and produce events properly
and many some times, producer will access the file and event
will be dropped, so event count is lower than access count.

I still have not very direct test result shows the patch works as expected. But, I did get some prove from running fanotify test *without* your overflow event patch.

Running fanotify without your overflow patch, sometimes I can see:

[ 122.222455] SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
[ 122.224198] cache: fanotify_event_info(110:fsnotify), object size: 56, buffer size: 88, default order: 0, min order: 0
[ 122.226578] node 0: slabs: 11, objs: 506, free: 0
[ 122.227655] node 1: slabs: 0, objs: 0, free: 0
[ 122.229251] SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
[ 122.230912] cache: fanotify_event_info(110:fsnotify), object size: 56, buffer size: 88, default order: 0, min order: 0
[ 122.233266] node 0: slabs: 11, objs: 506, free: 0
[ 122.234337] node 1: slabs: 0, objs: 0, free: 0

The slub oom information is printed a couple of times, it should mean neither the event is delivered nor the overflow event is delivered.

With your overflow patch, I've never seen such message.

Regards,
Yang





I am not going to be the one to determine if fixing this alleged bug is a
prerequisite for merging your patch, but I think enforcing memory limits
on
event allocation could amplify that bug, so it should be fixed.

The upside is that with both your accounting fix and ENOMEM = overlflow
fix, it going to be easy to write a test that verifies both of them:
- Run a listener in memcg with limited kmem and unlimited (or very
large) event queue
- Produce events inside memcg without listener reading them
- Read event and expect an OVERFLOW even

This is a simple variant of LTP tests inotify05 and fanotify05.


I tried to test your patch with LTP, but it sounds not that easy to setup a
scenario to make fsnotify event allocation just hit the kmem limit, since
the limit may be hit before a new event is allocated, for example allocating
dentry cache in open syscall may hit the limit.

So, it sounds the overflow event might be not generated by the producer in
most cases.


Right. not as simple, but maybe still possible as I described above.
Assuming that my patch is not buggy...

Thanks,
Amir.