[RFC][PATCH 00/10] On inode::i_count and the usage vs reference count issue

From: Peter Zijlstra
Date: Fri Feb 24 2017 - 14:05:36 EST


(my appologies if this arrives a second time; I seem to have
fat-fingered my send command the first time and things didn't reach
neither me or the list).

Hi all,

So I'm not entirely happy with these patches; but I don't really know
fs/inode.c as well as some of you and I figured I'd reached a point where I
need feedback (or maybe I'm well past that, we'll see).

So the kernel has recently grown a reference count type, this thing is fairly
strict with semantics; such that it can give 'helpful' warnings when people
'accidentally' violate the rules and create bugs.

The one at the core of this patch set is that refcount_t assumes 0 means 'free'
or 'freeing'.

The problem is that inode::i_count is _not_ a reference count, it is a usage
count (for lack of a better name), it counts how many active users of the inode
are out there. But 0 users is a perfectly fine state for an inode to be in,
it'll just sit in the cache waiting for a new user (or reclaim).

Now refcount_t has no operations to increment once we've hit 0, because if you
assume 0 means 'free', increment from 0 means use-after-free, and that's a bad
thing.

So what this patch-set attempts is doing a +1 bias on the usage-count to turn
it into an actual reference count, where the extra reference is the pointer the
cache itself has to the object.

This then results in the need to do something like: dec_and_lock at the 2->1
transition instead of the usual 1->0; for this purpose we introduce
refcount_dec_unless().

So far, it sounds fairly sensible; _except_ for the wee little problem that a
fair amount of code looks at the value of i_count. Some of this is fine, eg.
the evict path verifies it is indeed 0. But other places look at !0 values and
those are suspect.

To make matters worse; once i_count is a refcount, it appears trivial to avoid
inode_hash_lock for lookups (yay RCU!) and looking at i_count becomes even more
of a problem because then holding i_lock will not in fact stabilize it anymore.

So I've 'ignored' (by assuming they were already broken) the i_count
observers and done that RCU conversion -- even though I have no idea what
workload would hit the global inode_hash_lock hard enough for it to matter
(see, maybe I'm well past the point where I could've used feedback).


There's a number of options here:

- I'm not completely insane, and these patches can be made to work.

- We decide usage-counts are useful and try and support them in refcount_t;
this has the down-side that people can more easily write bad code (by doing
from 0 increments that should not have happened).

- We decide usage-counts need their own type (urgh, more...).

- None of the above, we keep i_count as is and let people hunt and convert
actual refcounts.


I'm ok with all those; I just figured it'd be 'fun' to convert something
non-trivial. FWIW, this boots and builds a kernel (but that's about all the
testing its had).