Re: [PATCH] Convert struct pid count to refcount_t

From: Joel Fernandes
Date: Thu Mar 28 2019 - 10:37:43 EST


On Thu, Mar 28, 2019 at 03:57:44AM +0100, Jann Horn wrote:
> On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> > On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote:
> > > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> > > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
> > > > <joel@xxxxxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > struct pid's count is an atomic_t field used as a refcount. Use
> > > > > refcount_t for it which is basically atomic_t but does additional
> > > > > checking to prevent use-after-free bugs. No change in behavior if
> > > > > CONFIG_REFCOUNT_FULL=n.
> > > > >
> > > > > Cc: keescook@xxxxxxxxxxxx
> > > > > Cc: kernel-team@xxxxxxxxxxx
> > > > > Cc: kernel-hardening@xxxxxxxxxxxxxxxxxx
> > > > > Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
> > > > > [...]
> > > > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > > > index 20881598bdfa..2095c7da644d 100644
> > > > > --- a/kernel/pid.c
> > > > > +++ b/kernel/pid.c
> > > > > @@ -37,7 +37,7 @@
> > > > > #include <linux/init_task.h>
> > > > > #include <linux/syscalls.h>
> > > > > #include <linux/proc_ns.h>
> > > > > -#include <linux/proc_fs.h>
> > > > > +#include <linux/refcount.h>
> > > > > #include <linux/sched/task.h>
> > > > > #include <linux/idr.h>
> > > > >
> > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> > > > > return;
> > > > >
> > > > > ns = pid->numbers[pid->level].ns;
> > > > > - if ((atomic_read(&pid->count) == 1) ||
> > > > > - atomic_dec_and_test(&pid->count)) {
> > > > > + if ((refcount_read(&pid->count) == 1) ||
> > > > > + refcount_dec_and_test(&pid->count)) {
> > > >
> > > > Why is this (and the original code) safe in the face of a race against
> > > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> > > > don't see this code pattern anywhere else in the kernel.
> > >
> > > Semantically, it doesn't make a difference whether you do this or
> > > leave out the "refcount_read(&pid->count) == 1". If you read a 1 from
> > > refcount_read(), then you have the only reference to "struct pid", and
> > > therefore you want to free it. If you don't get a 1, you have to
> > > atomically drop a reference, which, if someone else is concurrently
> > > also dropping a reference, may leave you with the last reference (in
> > > the case where refcount_dec_and_test() returns true), in which case
> > > you still have to take care of freeing it.
> >
> > Also, based on Kees comment, I think it appears to me that get_pid and
> > put_pid can race in this way in the original code right?
> >
> > get_pid put_pid
> >
> > atomic_dec_and_test returns 1
>
> This can't happen. get_pid() can only be called on an existing
> reference. If you are calling get_pid() on an existing reference, and
> someone else is dropping another reference with put_pid(), then when
> both functions start running, the refcount must be at least 2.

Sigh, you are right. Ok. I was quite tired last night when I wrote this.
Obviously, I should have waited a bit and thought it through.

Kees can you describe more the race you had in mind?

> > atomic_inc
> > kfree
> >
> > deref pid /* boom */
> > -------------------------------------------------
> >
> > I think get_pid needs to call atomic_inc_not_zero() and put_pid should
> > not test for pid->count == 1 as condition for freeing, but rather just do
> > atomic_dec_and_test. So something like the following diff. (And I see a
> > similar pattern used in drivers/net/mac.c)
>
> get_pid() can only be called when you already have a refcounted
> reference; in other words, when the reference count is at least one.
> The lifetime management of struct pid differs from the lifetime
> management of most other objects in the kernel; the usual patterns
> don't quite apply here.
>
> Look at put_pid(): When the refcount has reached zero, there is no RCU
> grace period (unlike most other objects with RCU-managed lifetimes).
> Instead, free_pid() has an RCU grace period *before* it invokes
> delayed_put_pid() to drop a reference; and free_pid() is also the
> function that removes a PID from the namespace's IDR, and it is used
> by __change_pid() when a task loses its reference on a PID.
>
> In other words: Most refcounted objects with RCU guarantee that the
> object waits for a grace period after its refcount has reached zero;
> and during the grace period, the refcount is zero and you're not
> allowed to increment it again.

Can you give an example of this "most refcounted objects with RCU" usecase?
I could not find any good examples of such. I want to document this pattern
and possibly submit to Documentation/RCU.

> But for struct pid, the guarantee is
> instead that there is an RCU grace period after it has been removed
> from the IDRs and the task, and during the grace period, refcounting
> is guaranteed to still work normally.

Ok, thanks. Here I think in scrappy but simple pseudo code form, the struct
pid flow is something like (replaced "pid" with data");

get_data:
atomic_inc(data->refcount);

some_user_of_data:
rcu_read_lock();
From X, obtain a ptr to data using rcu_dereference.
get_data(data);
rcu_read_unlock();

free_data:
remove all references to data in all places in X
call_rcu(put_data)

put_data:
if (atomic_dec_and_test(data->refcount)) {
free(data);
}

create_data:
data = alloc(..)
atomic_set(data->refcount, 1);
set pointers to data in X.

> > pud_pid to avoid such a race.
> >
> > ---8<-----------------------
> >
> > diff --git a/include/linux/pid.h b/include/linux/pid.h
> > index 8cb86d377ff5..3d79834e3180 100644
> > --- a/include/linux/pid.h
> > +++ b/include/linux/pid.h
> > @@ -69,8 +69,8 @@ extern struct pid init_struct_pid;
> >
> > static inline struct pid *get_pid(struct pid *pid)
> > {
> > - if (pid)
> > - refcount_inc(&pid->count);
> > + if (!pid || !refcount_inc_not_zero(&pid->count))
> > + return NULL;
> > return pid;
> > }
>
> Nope, this is wrong. Once the refcount is zero, the object goes away,
> refcount_inc_not_zero() makes no sense here.

Yeah ok, I think what you meant here is that references to the object from
all places go away before the grace period starts, so a get_pid on an object
with refcount of zero is impossible since there's no way to *get* to that
object after the grace-period ends.

So, yes you are right that refcount_inc is all that's needed.

Also note to the on looker, the original patch I sent is not wrong, that
still applies and is correct. We are just discussing here any possible issues
with the *existing* code.

thanks!

- Joel