Re: [PATCH] Convert struct pid count to refcount_t

From: Jann Horn
Date: Wed Mar 27 2019 - 21:00:14 EST


On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
> <joel@xxxxxxxxxxxxxxxxx> wrote:
> >
> > struct pid's count is an atomic_t field used as a refcount. Use
> > refcount_t for it which is basically atomic_t but does additional
> > checking to prevent use-after-free bugs. No change in behavior if
> > CONFIG_REFCOUNT_FULL=n.
> >
> > Cc: keescook@xxxxxxxxxxxx
> > Cc: kernel-team@xxxxxxxxxxx
> > Cc: kernel-hardening@xxxxxxxxxxxxxxxxxx
> > Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
> > [...]
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 20881598bdfa..2095c7da644d 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -37,7 +37,7 @@
> > #include <linux/init_task.h>
> > #include <linux/syscalls.h>
> > #include <linux/proc_ns.h>
> > -#include <linux/proc_fs.h>
> > +#include <linux/refcount.h>
> > #include <linux/sched/task.h>
> > #include <linux/idr.h>
> >
> > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> > return;
> >
> > ns = pid->numbers[pid->level].ns;
> > - if ((atomic_read(&pid->count) == 1) ||
> > - atomic_dec_and_test(&pid->count)) {
> > + if ((refcount_read(&pid->count) == 1) ||
> > + refcount_dec_and_test(&pid->count)) {
>
> Why is this (and the original code) safe in the face of a race against
> get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> don't see this code pattern anywhere else in the kernel.

Semantically, it doesn't make a difference whether you do this or
leave out the "refcount_read(&pid->count) == 1". If you read a 1 from
refcount_read(), then you have the only reference to "struct pid", and
therefore you want to free it. If you don't get a 1, you have to
atomically drop a reference, which, if someone else is concurrently
also dropping a reference, may leave you with the last reference (in
the case where refcount_dec_and_test() returns true), in which case
you still have to take care of freeing it.

My guess is that the goal of this is to make the "drop last reference"
case a little bit faster by avoiding the cacheline dirtying and the
atomic op, at the expense of an extra memory op and branch every time
we drop a non-final reference. But that's a pretty low-level
optimization, and forking by itself isn't exactly fast... I think the
clean thing to do would be to either move this detail into the
refcount implementation (if it turns out to actually be valuable in at
least a microbenchmark), or just get rid of it. Given the overhead of
fork()/clone(), I would be surprised if you could actually measure
this effect here.

Eric, can you remember the rationale for doing it that way in commit
92476d7fc0326a409ab1d3864a04093a6be9aca7? Am I guessing correctly?